This is just one of those solutions that was beautiful and simple and why didn't we think of that. The goal of distributed consensus can be viewed as something to do with creating a transitive order A < B < C < D out of individual ordering judgments: so I might know A < B and C < D, someone else knows B < C, eventually everybody can know the full sequence of items. Making this sequence contain metaïnformation about the entities that it contains has been a well-known trick for just about forever, usually done by using that metadata to elect a “leader” who handles the writes personally, for speed's sake.
But the idea of switching from a single leader to a quorum and then using the metadata to commit to overlapping quorum groups in order to preserve this transitive property between orderings was just really really slick.
About three years ago there was an early presentation about the backend https://www.youtube.com/watch?v=-TbRxwcux3c which made it just seem like the 4/6 thing just had to do with a normal replicated database or so, it seemed like a fixed topology to do with their availability zones—I didn't realize that they were also dealing with some sort of consensus overhead to enable swapping new nodes in or out.
There was also a somewhat marketingy presentation about this technology in the deep dive at https://www.youtube.com/watch?v=U42mC_iKSBg which I confess kinda went over my head at the time, so these two blog articles are a wonderful condensing of the heart of the service for me. None of the companies I've worked for has made much use of Amazon but it's a really clever idea that certainly deserves to kick around in the back of folks' heads.
Yes, a single coordinator using other machines to do a lot of the work does make some otherwise very difficult problems seem more tractable.
An interesting unit for this kind of design outside the cloud would be a single rack of servers (or however much you can hook up to one switch)--assure pretty low latency to your remote RAM/disk.
Whenever I see someone tout the magical properties of their distributed systems, I always want to see a report from Jepsen, which takes their system apart like a rotisserie chicken.
If only that was true. Consistency is a design constraint, not a production problem. Making things work well in real world in production is the hard part, this is what distributed systems are actually about. And I like that about Aurora, focusing on the hard part.
I like how it has almost become the norm to post the link to Adrian Colyers blog about any paper on Distributed Systems than a link to the paper itself. :)
Surprised there wasn't any mentioned of [1] in the paper's citations. This notion of articulating the internal sub-systems of a DB as first class elements of a distributed realization has been emerging (from various quarters) for the past few years. I would also include Apache Pulsar in this category. Aurora itself seems very interesting from an architectural point of view and definitely worth studying.
But the idea of switching from a single leader to a quorum and then using the metadata to commit to overlapping quorum groups in order to preserve this transitive property between orderings was just really really slick.
About three years ago there was an early presentation about the backend https://www.youtube.com/watch?v=-TbRxwcux3c which made it just seem like the 4/6 thing just had to do with a normal replicated database or so, it seemed like a fixed topology to do with their availability zones—I didn't realize that they were also dealing with some sort of consensus overhead to enable swapping new nodes in or out.
There was also a somewhat marketingy presentation about this technology in the deep dive at https://www.youtube.com/watch?v=U42mC_iKSBg which I confess kinda went over my head at the time, so these two blog articles are a wonderful condensing of the heart of the service for me. None of the companies I've worked for has made much use of Amazon but it's a really clever idea that certainly deserves to kick around in the back of folks' heads.