As a precursor to this (excellent, in depth) post I also recommend Martin Kleppmann's Designing Data-Intensive Applications, which is (to date) the definitive 101 book on the topic.
Good post! May sound more credible if you have mentioned that YugaByte started as a fork of Apache Kudu and some of the design decisions and technical challenges that you mention were addressed by the code that was already there before your fork.
I am the CTO/Founder of YugaByte and author of the above post. Thanks for your comments, glad you liked the post! You make a great point about YugaByte DB using Apache Kudu to start out, but wanted to clarify a few things.
* This is a post about our architectural decisions in building a distributed SQL database, and less so about how we actually implement it. Hence there is no mention of any starting points as far as the codebase.
I genuinely never thought I'd see a fully distributed db. This is the magic bullet people been looking for in terms of having a simple ish db compared to complex data layers.
Incredible work. And supporting full postgresql.
This does sound a lot like a marketing post. But me no cares. This is pretty awesome I'm rarely this hyped for a technology.
Before you let the hyperbole train run it’s course there’s are many distributed databases out there. Vitess+MySQL which runs YouTube, CockroachDB, YugaByte, faunaDB, bedrockDB etc
I always though it was one of the pursuit of Google but NoSQL was all the rage and I was never taken seriously ... until F5 that is.
Funnily enough, all what I would call "NoSQL papers" are very well known, yet F5 is hardly ever talked about. As if some people would not like to admit they might have been wrong ;-)
happy that you liked the post. as you could infer, this simplicity is by no means a magic bullet. there are always trade-offs depending on the database type you use as baseline for comparison. we highlight them here: https://docs.yugabyte.com/latest/introduction/#what-are-the-...
regarding the tone of the post, we are always open to feedback on how we can do better. let us know through any means possible incl. github and slack.
YugaByte DB product manager here. we have compared the Spanner and Calvin architectures in depth previously (https://blog.yugabyte.com/google-spanner-vs-calvin-global-co...). one key difference comes from the fact that Calvin’s deterministic locking protocol requires advance knowledge of all transactions’ read/write sets before transaction execution can begin -- this leads to repeated transaction restarts if the secondary index is on a column whose value is changing frequently as well as no support for client-initiated sessions transactions. in other words, Calvin is better suited for a transactional NoSQL database as opposed to a true distributed SQL database.
Under contention, a Calvin-based system will behave similarly to others which use optimistic locking schemes for Serializable isolation such as Postgres, or YB itself. There are advantages to the Calvin approach as well. For example, under Calvin, the system doesn't have to write out speculative intents to all data replicas in order to detect contention: The only writes to data storage are successful ones. The original paper only describes this briefly, but you can read about how FaunaDB has implemented it in more detail: https://fauna.com/blog/consistency-without-clocks-faunadb-tr...
It's also not a stretch to see how the protocol described in that post can be extended to support session transactions: Rather than executing a transaction per request, the transaction context is maintained for the life of the session and then dispatched to Calvin on commit. (This is in fact how we are implementing it in our SQL API feature.)
I would instead say that one of the more significant differences between Calvin and Spanner is the latter's much stricter requirements it places on its hardware (i.e. clock accuracy) in order to maintain its correctness guarantees; a weakness its variants also share.
Only if things were that simple :) Calvin avoids the need to track clock accuracy by making every transaction go through a single consensus leader which inherently becomes a single point of bottleneck for performance and availability. Spanner and its derivatives including YugaByte DB chose not to introduce such a bottleneck in their architecture with the use of per-shard consensus — this means caring about clock skews for multi-shard transactions and not allowing such transactions to proceed on nodes that exhibit skew more than the max configured. The big question is which is more acceptable: lower performance & availability on the normal path OR handling offending nodes with high clock skew on the exceptions path?
Sorry I can't let this go unchallenged. Again, you are inventing an architectural deficiency where the is none. The log in Calvin is partitioned and does not require all transactions to go through a single physical consensus leader. There is no single node in a Calvin cluster which must handle the entire global stream of transactions. The Calvin paper itself extensively covers how this works in detail: http://cs.yale.edu/homes/thomson/publications/calvin-sigmod1...
The article makes mention of Aurora and Spanner as inspiration; did CosmosDB not exist yet? I realize Cosmos is really more of a document database but given its ability to be queried with SQL in addition to massive horizontal scale and global replication seems like it should be more or less in the discussion.
YugaByte DB product manager here. Yes, CosmosDB and its underlying architecture were indeed not yet publicly available when we started the YugaByte DB project early 2016. However, classifying CosmosDB as a distributed SQL database is a bit of stretch given no support for SQL on the write path (the write path uses a custom document API). This means no support for multi-row/multi-shard ACID transactions as well as features such as client-initiated session transactions.
Thanks for sharing these. Hadn’t heard of the first 2, so look fwd to reading them. We did review the Amazon Dynamo architecture (which is used in Dynomite) in depth but found it to be lacking for supporting even single-row linearizability. multi-row ACID with serializable isolation builds on top that property. These are must-haves in a storage architecture to power a distributed SQL API.
assume you are referring to https://en.wikipedia.org/wiki/NonStop_SQL -- was not aware of it so thanks for bringing it to attention. first impression is that it was a technology way ahead of its times. in late 80s (and some would argue even recently), majority of apps can be run easily on monolithic/single-node SQL databases. they are not generating data at a volume that requires multiple nodes (in this case parallel computers) and user perception can indeed tolerate high latencies, manual failover and repair. we are now in a different era when it comes to data volume and user perception.
From a technical perspective, I'd guess that the reason was because it makes local transactions trickier to do. Spanner has a philosophy of avoiding work that isn't required, and knows how to optimize transactions that don't require 2PC-ing data on different servers.
He also did these awesome Tolkien-esque maps of the database engine ecosystem: https://martin.kleppmann.com/2017/03/15/map-distributed-data...
Anyway, I inject this sort of stuff directly into my veins, so thanks very much for the post!