Technical Challenges Developing a Distributed SQL Database

kthejoker2 · on April 27, 2019

As a precursor to this (excellent, in depth) post I also recommend Martin Kleppmann's Designing Data-Intensive Applications, which is (to date) the definitive 101 book on the topic.

He also did these awesome Tolkien-esque maps of the database engine ecosystem: https://martin.kleppmann.com/2017/03/15/map-distributed-data...

Anyway, I inject this sort of stuff directly into my veins, so thanks very much for the post!

rekoros · on April 26, 2019

Good post! May sound more credible if you have mentioned that YugaByte started as a fork of Apache Kudu and some of the design decisions and technical challenges that you mention were addressed by the code that was already there before your fork.

rkarthik007 · on April 26, 2019

Hi rekoros,

I am the CTO/Founder of YugaByte and author of the above post. Thanks for your comments, glad you liked the post! You make a great point about YugaByte DB using Apache Kudu to start out, but wanted to clarify a few things.

* This is a post about our architectural decisions in building a distributed SQL database, and less so about how we actually implement it. Hence there is no mention of any starting points as far as the codebase.

* Also, we have used the RocksDB and PostgreSQL codebases in their entirety in addition to Apache Kudu, and make no secret of this fact (https://docs.yugabyte.com/latest/architecture/concepts/ackno...).

Justsignedup · on April 26, 2019

I genuinely never thought I'd see a fully distributed db. This is the magic bullet people been looking for in terms of having a simple ish db compared to complex data layers.

Incredible work. And supporting full postgresql.

This does sound a lot like a marketing post. But me no cares. This is pretty awesome I'm rarely this hyped for a technology.

gigatexal · on April 27, 2019

Before you let the hyperbole train run it’s course there’s are many distributed databases out there. Vitess+MySQL which runs YouTube, CockroachDB, YugaByte, faunaDB, bedrockDB etc

ldng · on April 27, 2019

I always though it was one of the pursuit of Google but NoSQL was all the rage and I was never taken seriously ... until F5 that is.

Funnily enough, all what I would call "NoSQL papers" are very well known, yet F5 is hardly ever talked about. As if some people would not like to admit they might have been wrong ;-)

ddorian43 · on April 27, 2019

It's not talked about cause it's named F1 ;)

ldng · on April 30, 2019

Yes ! Thanks for the correction.

manigandham · on April 27, 2019

> "fully distributed db"

What do you mean by this? There are lots of examples of distributed databases.

sidch · on April 26, 2019

happy that you liked the post. as you could infer, this simplicity is by no means a magic bullet. there are always trade-offs depending on the database type you use as baseline for comparison. we highlight them here: https://docs.yugabyte.com/latest/introduction/#what-are-the-...

regarding the tone of the post, we are always open to feedback on how we can do better. let us know through any means possible incl. github and slack.

whargarbl · on April 27, 2019

AFAICT it is very far from supporting full Postgresql.

Nican · on April 26, 2019

I find Spanner an absolute beautiful database, and I recommend people to checkout CockroachDB as well, which is inspired by Spanner.

zenithm · on April 26, 2019

Shouldn't talk about Spanner and Percolator without discussing Calvin, either. https://www.infoq.com/articles/relational-nosql-fauna

sidch · on April 26, 2019

YugaByte DB product manager here. we have compared the Spanner and Calvin architectures in depth previously (https://blog.yugabyte.com/google-spanner-vs-calvin-global-co...). one key difference comes from the fact that Calvin’s deterministic locking protocol requires advance knowledge of all transactions’ read/write sets before transaction execution can begin -- this leads to repeated transaction restarts if the secondary index is on a column whose value is changing frequently as well as no support for client-initiated sessions transactions. in other words, Calvin is better suited for a transactional NoSQL database as opposed to a true distributed SQL database.

freels · on April 27, 2019

This take on Calvin is inaccurate:

Under contention, a Calvin-based system will behave similarly to others which use optimistic locking schemes for Serializable isolation such as Postgres, or YB itself. There are advantages to the Calvin approach as well. For example, under Calvin, the system doesn't have to write out speculative intents to all data replicas in order to detect contention: The only writes to data storage are successful ones. The original paper only describes this briefly, but you can read about how FaunaDB has implemented it in more detail: https://fauna.com/blog/consistency-without-clocks-faunadb-tr...

It's also not a stretch to see how the protocol described in that post can be extended to support session transactions: Rather than executing a transaction per request, the transaction context is maintained for the life of the session and then dispatched to Calvin on commit. (This is in fact how we are implementing it in our SQL API feature.)

I would instead say that one of the more significant differences between Calvin and Spanner is the latter's much stricter requirements it places on its hardware (i.e. clock accuracy) in order to maintain its correctness guarantees; a weakness its variants also share.

sidch · on April 27, 2019

Only if things were that simple :) Calvin avoids the need to track clock accuracy by making every transaction go through a single consensus leader which inherently becomes a single point of bottleneck for performance and availability. Spanner and its derivatives including YugaByte DB chose not to introduce such a bottleneck in their architecture with the use of per-shard consensus — this means caring about clock skews for multi-shard transactions and not allowing such transactions to proceed on nodes that exhibit skew more than the max configured. The big question is which is more acceptable: lower performance & availability on the normal path OR handling offending nodes with high clock skew on the exceptions path?

freels · on April 27, 2019

Sorry I can't let this go unchallenged. Again, you are inventing an architectural deficiency where the is none. The log in Calvin is partitioned and does not require all transactions to go through a single physical consensus leader. There is no single node in a Calvin cluster which must handle the entire global stream of transactions. The Calvin paper itself extensively covers how this works in detail: http://cs.yale.edu/homes/thomson/publications/calvin-sigmod1...

molsson · on April 26, 2019

Great article!

mikece · on April 26, 2019

The article makes mention of Aurora and Spanner as inspiration; did CosmosDB not exist yet? I realize Cosmos is really more of a document database but given its ability to be queried with SQL in addition to massive horizontal scale and global replication seems like it should be more or less in the discussion.

Reference to querying Cosmos with SQL: https://docs.microsoft.com/en-us/azure/cosmos-db/how-to-sql-...

sidch · on April 26, 2019

YugaByte DB product manager here. Yes, CosmosDB and its underlying architecture were indeed not yet publicly available when we started the YugaByte DB project early 2016. However, classifying CosmosDB as a distributed SQL database is a bit of stretch given no support for SQL on the write path (the write path uses a custom document API). This means no support for multi-row/multi-shard ACID transactions as well as features such as client-initiated session transactions.

manigandham · on April 27, 2019

CosmosDB SQL is just the interface and has no support for relational semantics at all.

chalianwar · on April 28, 2019

Following research papers from academia also explored similar problem:

1) BESPOKV: Application Tailored Scale-Out Key-Value Stores http://people.cs.vt.edu/~butta/docs/sc18-bespokv.pdf

2) ClusterOn: Building Highly Configurable and Reusable Clustered Data Services using Simple Data Nodes http://people.cs.vt.edu/~butta/docs/hotstorage16-clusteron.p...

DynomiteDB from Netflix allowed to scale singe server NoSQL stores. https://github.com/Netflix/dynomite.

sidch · on April 29, 2019

Thanks for sharing these. Hadn’t heard of the first 2, so look fwd to reading them. We did review the Amazon Dynamo architecture (which is used in Dynomite) in depth but found it to be lacking for supporting even single-row linearizability. multi-row ACID with serializable isolation builds on top that property. These are must-haves in a storage architecture to power a distributed SQL API.

protomyth · on April 26, 2019

I wonder where the lessons of NonStop SQL fit into the picture? It seems like there is some knowledge to mine there.

sidch · on April 26, 2019

assume you are referring to https://en.wikipedia.org/wiki/NonStop_SQL -- was not aware of it so thanks for bringing it to attention. first impression is that it was a technology way ahead of its times. in late 80s (and some would argue even recently), majority of apps can be run easily on monolithic/single-node SQL databases. they are not generating data at a volume that requires multiple nodes (in this case parallel computers) and user perception can indeed tolerate high latencies, manual failover and repair. we are now in a different era when it comes to data volume and user perception.

qaq · on April 27, 2019

And if we are being honest not much has changed 99% of the projects still can run fine on monolithic/single-node SQL databases.

jayd16 · on April 26, 2019

Interesting that spanner doesn't support foreign keys. Is this a fundamental limitation or just a missing feature?

joatmon-snoo · on April 26, 2019

Spanner uses a hierarchical table model instead, where a parent table key is a strict prefix of the child table key.

https://cloud.google.com/spanner/docs/schema-and-data-model#...

From a technical perspective, I'd guess that the reason was because it makes local transactions trickier to do. Spanner has a philosophy of avoiding work that isn't required, and knows how to optimize transactions that don't require 2PC-ing data on different servers.

Nican · on April 26, 2019

Spanner does not support foreign keys, but they do support table interleaving, which can fill some of the gaps.

Table interleaving is also important to ensure that correlated data that is colocated in the same server in the cluster.

wnevets · on April 26, 2019

it sounds like a fundamental limitation to support its feature set. I'm curious to hear from who someone knows better.

tyopiuy · on April 26, 2019

Not really developing a distributed dB but using one. Inaccurate and misleading title

penagwin · on April 26, 2019

No this is about developing one. They talked about how they looked at the existing database options and how they choose different bits/pieces.

triplee · on April 26, 2019

That wasn't immediately relevant to me, but is definitely a good read since I imagine I'll be dealing with these problems within a year.

Thanks!

triplee · on April 29, 2019

Why was this downvoted? I literally thanked the poster for sharing this because it was interesting and a thing I may have to run into soon.

I'm sorry you're all offended that I'm on legacy projects right now and we're not at this point yet.