I don't find it surprising that so many distributed databases fail at ensuring consistency; it's a very hard problem. What really gets me is that they deliberately, knowingly sacrifice consistency for performance, and then claim the opposite in their documentation/marketing.
Every time the topic of consistency in distributed systems arise I see this idea mentioned in a way or the other that the problem is how hard strong consistency is. To be honest is a problem solved a long time ago by consensus algorithms: there is a lot more work to do in the future in order to explore other useful consistency models, but if you want strong consistency you know you have "just" to use a consensus algorithm like Paxos or Raft or Viewstamped replication or whatever, and you end with a CP system. Of course it is non trivial even to implement those systems correctly, but the point is, most implementations tested by Aphyr don't attempt to do this at all most of the times.
There will be bugs but in the end if your goal is to write a consistent store, you can do it. Bugs in wanna-be CP systems and/or failed attempts at implementing CP because of home-made protocols with flaws are part of the issue and part of what Aphyr's research uncovers, but IMHO the central point is a different one: CP systems have performance limits so many real world systems don't even try to be CP, which is fine, it's up to the designer to pick the DB design and tradeoffs, but you need to document it properly. What happens during partitions? What consistency model the system employs if it does not feature strong consistency? Is the system Available during partitions? And so forth. As long as the documentation is honest, it's up to the user to understand, for its use case, if the system is a good fit or not.
What we are seeing often is an attempt to informally document what should be well specified. Sometimes there are also incorrect things stated in bold letters in the documentation, like "the system is consistent but sacrifices partition tolerance" which does not mean anything in the context of partitions actually happening in the real world.
Aphyr's effort and our collective effort as people working, implementing and using databases, in my opinion, is not to end with databases designed all the same way or all providing strong guarantees, but with clear understandable and honest documentation describing the system behavior.
> but IMHO the central point is a different one: CP systems have performance limits so many real world systems don't even try to be CP
They try on paper, so to speak. They might have a CP component, but it is not in the datapath -- it is there for maybe cluster configuration, or say to elect a leader. And then they can claim they have CP because everything goes through the leader, and the leader was elected by a CP component. However without realizing they just pushed all the error cases into corners -- when the membership changes (nodes die, die in unpredictable ways, become partitioned, new nodes added...). They never test that extensively, which is a hard thing to do, because there are so many cases and combinations.
The bottom line is even if they have a Paxos or Raft component in their system doesn't mean their data is stored consistently.
The only thing I'd take out of your comment is "all providing strong guarantees". I'd say that if you document that it does provide those guarantees that it damn well better do that (modulo bugs of course).
That's really been the most enlightening part of the series for me.
Reinforces that the hardest part in engineering is rarely the technical problem. Distributed databases are really f*ing hard, but infinitely harder if the people can't work together or don't open up to faults.
From experience I think some of the problem is that not everyone appreciates the importance of correctness in this sort of system. At the very least you should clearly documenting your expected failure modes, so that it's possible to build correct things on top of it.
In part it's hard to convince everyone of the importance of spending time on what looks like an unlikely corner case until after the outage - when it also becomes much harder to fix as you usually need to build this into the core design of your system.
>I don't find it surprising that so many distributed databases fail at ensuring consistency; it's a very hard problem
It's not hard — it's slow. Sending reads and writes through Paxos or Raft gives you sequential consistency. But not surprisingly, touching a quorum of nodes for every operation is too slow to be practical for many workloads.
And that's usually fine — most data aren't bank account balances.
Online transaction processing creates "pending" transactions, and the data is often inconsistent. Your charge may exist in the merchant's database but not post to your online banking for several hours. Or it may be a wildly different amount - i.e. gas stations will place a $100 hold on your debit card and it will stay that way for days until it's settled for the actual (lesser) amount. If you were accidentally double-charged, than rather than processing a separate refund, the merchant may simply not settle the duplicate charge, and it will drop off your pending transactions... eventually. The lag time may be several days or weeks. If it's a debit card, you can't spend the money during that time and you may be temporarily broke because of it.
If you make an ACH transfer, money will disappear from your account one night, spend a business day in the aether, and then post to the recipient's account on the third day. The system is in inconsistent state (i.e. money is missing) for at least a day, possibly a whole weekend.
The actual transaction is settled and goes into "posted" state with a lag of 2.5 * 10^8 ms - i.e. 3 business days. That's if you're lucky. Banks do need strong consistency, but not in anything approaching realtime.
Even ZooKeeper could probably handle the U.S.'s financial transactions faster than current infrastructure.
And what would shock you most is some of the methods used to send those transactions (eg ftping a text file, and if it gets corrupt, someone opens it in vi to fix it - I have a friend who used to do exactly that).
I think you touched on a key point a lot of people don't consider. Not all data is the same nor does it have to be treated the same. Many distributed systems will choke on a piece of data here and there and may not give the appropriate response. But you need to consider what you're working on and if that's OK from time to time given the advantages the system you're considering offers.
What might be considered an appropriate data store for one data set may not be for another. And the type of information a lot of distributed systems are handling are far from bank accounts, for example.
And if you're handling ultra important, sensitive data there are techniques that have been available for many years (two phase commit, for example) that can help.
I love this series but I'm blown away by how many engineers here automatically assume a system is a failure because it doesn't pass a certain type of test from time to time. I do agree with the series that the marketing materials shouldn't claim things, however. Companies should be honest with what their systems weaknesses currently are.
Though, it's not even remotely okay if your documentation and marketing material indicates that consistency isn't being sacrificed for performance. :)
I suspect that no experienced programmer expects to get data safety for free. We do expect that the documentation doesn't lie to us when describing the level of data safety provided.
Right, that's the real value here — holding databases accountable for their claims. Zookeeper clearly documented that writes are consistent but reads aren't, so it "passed" Jepsen cleanly. The Etcd guys claimed reads are linearizable, so Aphyr tested for linearizability and it failed.
I totally agree, just a note: write safety and consistency are not the same thing. An example is a CP store that breaks linearizability in order to serve reads without asking other nodes, but just providing the local value. Writes are as safe as a true CP store, but the consistency model is different and weaker. It is true that many times the replication model employed also has effects on the ability of writes to survive, I just mentioned the above example to show it is not always the case.
Thanks to aphyr for the testing. Certainly it will help focus efforts on improving things.
I am a MariaDB contributor and operate Percona Cluster in production, so I can talk a bit about Galera.
It's recommended that writes go to one master, rather than be distributed across the nodes. That will help with isolation issues.
Also, some commenters have complained about year-old releases. PXC has improved significantly in the past year regarding manageability, so you may want to try again. For example, the startup script has a bootstrap option now.
For most people, vanilla async MySQL replication works best, esp. 5.6. But Galera gives you another option when you need something else.
Having said that, it takes 5-10 years for a database or filesystem to mature, so anybody using Galera now is an early adopter.
I flirted with MariaDB Galera Cluster. This was on Amazon EC2. Frequently mariadb would just go down, not the server mind you. Frequently the whole cluster would go down as well. To bring it back up, I'd have to bootstrap the whole cluster.
Essentially, what a pain in the backside to use.
This did not fill me with confidence and thankfully did not go into production and later on went with Postgres/CitusDB.
I had similar issues, but this was a few years ago. It was not something I was comfortable deploying in any production environment because it fell apart so many times in dev. Luckily I caught these issues before it was used in production.
Everytime i read one of those post (or other about mongodb failures), i keep thinking about this talk http://youtu.be/4fFDFbi3toc
And why noone tried to repeat their strategies for building a robust db system : start by building an extremely robust failure simulation and testing facility. Then build your product.
Actually, i think what those guys at foundationdb did was so exceptional, that by buying the company and killing the product, Apple harmed the software industry for the next 10 years. The fact the foundationdb is mentionned in OP as the only distributed db system one could recommend makes me more confident making that statement.
This might be overdoing things a bit. FoundationDB was a rigorously tested KV store with strong consistency (uncommon for such a product). Yes, that talk from StrangeLoop was great, but there's two common misconceptions here:
(What follows is my opinion / guesswork as a VoltDB employee)
1. FoundationDB wasn't killed by Apple; it was rescued by Apple. The product couldn't compete on just being a KV store and wasn't doing well in the market. Apple saw a very bright and now experienced team and scooped them up for a song.
2. Before this happened, FoundationDB realized they needed a way to query their system to compete, so they bought Akiban (a failing SQL db company) to add SQL to their system. But they assumed they could do this without deep integration, which was wrong. They added a SQL "layer" on top of the KV store and it was way to slow to be practical. The benchmarks they published were embarrassing.
Actually, the thing that i find most impressive in their tech stack is the approach they took for building it starting with the simulator + c++ extension. Those are the technology that i think would benefit all the community, if they were ever open sourced.
As a voltDB engineer, how do you ensure your implementation doesn't compromise the theorical correctness of your system ?
VoltDB is actually a bit simpler with what it promises, full serializable ACID for all transactions. This is much easier to understand and verify than lesser isolation.
We think what we've done is pretty clever too. We've built a determinism checker into our replication engine, so that we can verify that each replica has the same state at each logical point in time, and each operation makes identical changes to that state.
Then we built test patterns that are designed to be as co-dependent as possible and run them against a replicated VoltDB cluster. That VoltDB cluster goes though one or more kinds of failure, including multiple simultaneous failure, and then a checker ensures no data is ever lost, corrupted or run in the wrong order.
It's different from the FDB thing. The simulation they did is certainly easier to run on a pure KV store, but keep in mind we also have to test SQL that queries millions of rows, along with upserts, materialized aggregations, etc...
We're working on some blog content on this in addition to my talk. Stay tuned.
Great, i'm looking forward to see that talk. To me, it seems that the only way you can make sure that an implementation is correct, is either via extensive testing ( and when talking about acid property over a distributed system, i think the kind of simulator fdb built is a requirement), or run a theorem proover ofer your source code, ala Coq. But i've never heard of any code analysis tool that is able to guarantee properties over a distributed system.
But, since i'm absolutely not working on the field, i'm really looking forward to see what professional people like you are finding to tackle those issues.
FoundationDB was hardly redefining the database industry. There is more to a database being successful in the market than how well it fairs in one particular guy's blog posts (as excellent as they are). Apple isn't harming anyone.
I know. Sorry, I should have been clearer; what I meant was Aphyr has had some fine results for SQL DBs, provided you avoided distribution. And he has had some fine results for NoSQL DBs, when distributed, provided you preferred availability to consistency (generally in certain scenarios, or with certain caveats). What he has not had, and so presumably was lamenting, was any distributed DB that offered transactions that passed any sort of muster, except for FoundationDB, which was made defunct. That, or any DBs he could recommend that met every claim their documentation made (as that's largely what he's testing for).
I meant that with this being the first distributed SQL store, he was saying there was nothing he could recommend that actually offered similar guarantees to what this was claiming. That is, ACID style (well, technically --ID I believe) DB transactions in a distributed environment. FoundationDB did (though it was NoSQL), hence his mentioning it. But that's different than there is nothing he can recommend, at large; he can't recommend similar solutions, but for a given problem he could likely recommend a compelling, different, solution.
I wish more companies supported people like Stripe does for Aphyr. Research is so valuable, but there's so little incentive to be in academia. There's a certain irony that undergrad left me with so much debt I couldn't really consider joining a research university...
> I wish more companies supported people like Stripe does for Aphyr. Research is so valuable, but there's so little incentive to be in academia.
If you're talking about incentives, don't forget that Kyle's research is some of the most commercially useful research out there at the moment, and Kyle is also skilled at implementing his ideas (a rare combination among researchers). I'm actually quite pleasantly surprised that he's putting so much effort into these blog posts rather than writing a string of journal papers.
Is there a point to writing journal papers if you're not playing the academic publication / citation game? Web sites are easier to publish and have wider distribution than academic journals.
But there has been damning evidence coming out recently that much of the historical "peer" reviewed studies are in fact wrong or have outright flaws. Further, I'd say that with the tools being provided for free and the blog posts Kyle's findings are most likely peer reviewed... And copany reviewed... And product team reviewed..
> But there has been damning evidence coming out recently that much of the historical "peer" reviewed studies are in fact wrong or have outright flaws.
Sure, I agree (well, I'd probably replace "much" with a less strong word, and skip the "damning", but that's just semantics). Plenty of studies haven't been demonstrated to be wrong, and plenty of others were wrong despite being correctly designed experiments (something both peer reviewers and the scientists performing the experiments could not have caught). Moreover, many flawed studies haven't been accepted by peer reviewers, which is in fact the purpose of peer review. The biggest problem with peer review is probably publication bias against negative results, without which I suspect most demonstrated scientific fraud wouldn't exist, but that doesn't mean the whole process needs to be thrown out.
> Further, I'd say that with the tools being provided for free and the blog posts Kyle's findings are most likely peer reviewed... And copany reviewed... And product team reviewed..
Yes, Kyle does (hell, he's been cited in academic papers). The average person publishing blog posts does not, though.
After spend an hour reading Retraction Watch it's absolutely clear to me that academic scientific publishing has nothing to do with verifying correctness and everything to do with politics. [1]
The idea that peer review has nothing to do with verifying correctness is extremely misguided (as would the idea be that politics don't enter into it at all). Or do you feel that the existence of retractions invalidates the entire scientific process?
There is no scientific process. If you spend some time looking at the posts you'll see that the reason for the retraction there has to do with skewing results to match the views of the those funding research and/or to sway opinion in a direction.
Retraction Watch itself has the title, "Tracking retractions as a window into the scientific process." So clearly they believe that there is such a thing as the scientific process. Retractions are an important part of it--science needs to be robust to mistakes.
I know zero about academia – could he also publish this work in a string of journal papers and reap the rewards (whatever they are) of both a successful and widely useful blog series and more formal academic recognition?
Absolutely. But I doubt he'd find as much value for his time by writing journal papers at the moment, given that Stripe will essentially pay him a decent salary to continue his open-access research. There's always time for a retrospective book :)
Is there any data store that has come out of these tests looking good? Every Aphyr post I've read had a pretty big failure at some point, even for datastores I considered solid, like Riak or Cassandra.
Huh, compared to what the other datastores got, that's high praise. Too bad zookeeper is more for configuration management rather than for storing generic data.
FoundationDB... which is probably one of the reasons that they were acquired. Aphyr mentions them at the end of the article:
"You might adopt a different database–though since Galera is the first distributed SQL system I’ve analyzed, and FoundationDB disapparated, I’m not sure what to recommend yet."
I don't think Aphyr ever tested FoundationDB, that I can find.
They ran some tests themselves using the Jepsen framework, but given that a large part of the testing is setting up an environment that exposes the system's weaknesses, that doesn't give me the same level of confidence.
'Looking good' is pretty subjective. While a lot of these datastores include some sort of "oh, and you can use it this other way", and their marketing documents make it sound like it is totally suitable, the point of these tests is to see what purpose you can put these datastores to and have them be a proper fit. While some are basically "you can't use this unless you really don't care about your data", the two you list are actually great examples of "provided you use them in their niche, they're great; otherwise, they're not". Cassandra, use when writing high volumes of AP immutable data. Riak, use with CRDTs for AP mutable data.
There is far more aspects to big data than just how much data you store.
But regardless you completely missed the point. PostgreSQL was tested as a single instance. Everything else clustered. It's apples and oranges. The clustering part is the hard part here.
My point was simply to remind people that over complicating your situation and going for a "big data" store is usually a mistake when there's something that will keep your data safe when it isn't as big as you assume it's going to be.
I think the takeaway, about a dozen posts back, was that your clustered database is shit and you shouldn't rely on it. Sharding at the client level may be the best approach. Sometimes the old ways are best.
Those are fundamentally different things. Sharding is a scaling strategy for reads, writes and data size. It doesn't address durability (which many distributed databases do). It doesn't address the discoverability problem of how you know which piece of data is where (easy if you have a well-distributed natural key, potentially difficult if you have a poorly distributed natural key, potentially obscenely difficult if you don't have a natural key). It doesn't address data locality (it can, but then you need to migrate keys, and end up with a second-order 'which key is where?' problem). Most crucially, it doesn't address what you need to do if you have to do transactions (of any kind) across shards. You can try invent something to do that. You may get it right (probably using consensus or 2PC), or you may get it fundamentally wrong. Either way, it's a tricky problem.
Sharding is good, for sure. It's not a replacement for this kind of technology.
Aphyr is doing great work keeping vendors honest, and making sure they live up to their marketing. I don't think any of this says that clustered databases are inherently shit.
That's fair. I may have been a little cavalier with the use of the word shit. Though I note auto sharding is a headline feature for many systems tested. And as for durability, many seem to fall short of rsync running out of cron.
I think SQL Azure that uses SQL Server underneath would pass his tests. But then again it's not hot valley stuff put out by engineers who discovered distributed systems 2 years ago.
Isn't the bigger problem that it is impossible to test any DBaaS effectively since we cannot purposely cause failures and network partitions? The general theme of Call Me Maybe has been that these projects overstate their actual capabilities, and the only way to know is to check.
It's no mystery that the kinds of systems with massive reach--Google search, Facebook, Twitter--are really not data-consistency-critical applications.
Who'll ever know if a search result isn't perfectly up to date or perfectly accurate?
Who'll ever know if you missed a Facebook feed entry because it "wasn't relevant" or simply wasn't seen due to DB vagaries?
And who'll ever know about a few tweets going astray here or there?
In all cases, they're all likely "eventually consistent" (or close to it), but it's no accident that it doesn't ultimately matter in those massive scale examples.
And maybe that's the secret to massive scale--it can't ultimately matter.
One of the startups I worked on was a classifieds aggregation engine pulling data from external feeds.
We basically queued all retrieved items for processing with no attempts at avoiding data loss whatsoever - including using in memory queues for lots of things.
Our reasoning was that if a machine crashed, worst case was that a few listings would take up to 24 hours to update, but generally much less (we adapted crawling rate for our sources based on change frequency; so large sources of listings would get re-indexed far more frequently, so if a feed didn't update or 24 hours it'd be because it wasn't a source of much data), and we could force refreshes of the data.
Some people were horrified at the approach because the idea of ensuring consistency and not losing data is so ingrained. But the reality is that you need to measure the cost of consistency up against the value it provides. And often it's not very valuable, especially when there is an authoritative source of the data to recover from and when the data will be outdated quickly anyway.
A lot of the time any notion of consistency is an illusion anyway - by the time the page has returned, the results are outdated - and what matters is maintaining the illusion (e.g. ensure that if a user makes an update it's reflected in the page that's returned).
The key is you need to know the tradeoffs and apply them consciously rather than get caught out by tradeoffs components you rely on makes without telling you..
Absolutely. I wrote a system to provide real time signature updates to Symantec's (then MessageLabs) global anti spam infrastructure. It used UDP to send them. It worked amazingly well and distributed signatures across thousands of servers worldwide in milliseconds and if a packet got dropped oh well. It's about choosing the right tool for the job.
People used to purely transactional systems where every piece of data they received were important who were not used to think about data as potentially disposable... Nothing they didn't pick up quickly enough, but a bit of a mental adjustment (for my part, my job right before that one was running a billing system; I was very happy not to have to worry about that any more...)
Can I point out that I'm utterly impressed and astounded by Kyle's ability to remain detached and professional in the faces of what appear to be outright falsehoods stated by some of these companies?
If I had gone to his lengths to critically evaluate the safety of a database system, and then it comes out that the marketing materials or the words of the developers were.. significantly misleading at best, my first response is likely to be a profanity laden rant, not a cool recounting of how and why they're wrong.
Despite the reliability, that product sounds nice -- is there anything for Postgres with similar ease of setup for replication etc. that someone can recommend? E.g. EnterpriseDB ?
PostgreSQL replication isn't hard these days, since 9.x and beyond its just a couple of lines of config, if you want active/active that's a different story for any ACID compliant databases, I'd look at CitusDB / Postgres-XL
AWS Aurora seems to go at the problem differently. They spread the disk out over three AZs in 10GB chunks. So if the instance powering the Aurora db engine crashes, a new one spins up quickly and begins pulling from the same disks.
> AWS allows for an ops team to completely have no concern for hardware. Lovely.
I split my time between managing a few racks worth of servers at one company, and an AWS setup at another client. The amount of ops work per server/instance is higher for the AWS client than it is for the company I manage physical racks for. That's despite the fact I do physical cabling and racking of equipment as necessary.
The issue is that there are so many aspects of the AWS infrastructure that takes extra effort because we don't have full control. E.g. we can stuff whatever disk subsystem we want in the servers and not have to work our way around the lack of any truly fast disk subsystems in EC2. So for every day I don't physically move serves, I spend 3-5 working around AWS limitations.
I agree with you with respect to what you want, but we're taking baby-steps today - AWS is way too expensive for most people to move to it, even before you factor in the ops complexities.
Same on Azure. Their networking stack is ... strange and undocumented. It's the worst IP network I've used in a while. Dealing with IP/TCP-level issues should not be something that's taking up any noticeable amount of time on a modern network. With Azure, it is.
I certainly think this will be the trend for startups but I believe it will take much longer for established companies to get out of the buying hardware business. Often times running locally can be drastically less costly than running on Amazon -- especially if you already have the staff dedicated to that sort of role.
I don't know of any reports on it, but from personal experience, the amount of time spent on average per physical server I configure and shift around is a tiny fraction of the time I spend dealing with AWS peculiarities you don't have to deal when you control your own hardware (I currently split my time 60/40 between two clients - one with a physical hosting environment set up as a private cloud split over two colo's, and one in AWS)
The company I manage physical servers for would go bankrupt if we were to pay AWS level rates. Last time we priced it out we were looking at ~3x hosting cost - including fully loaded cost for staff time spent on maintenance etc.
I wouldn't have guessed that from the title. What I got was something security related to MariaDB. And that's because I got used to the funny gifs and the `hacking is cool` mentality of the security industry...
It's a fairly long running and well known series. The first one was published a couple of years ago and was themed around the Carly Rae Jepsen song titled "Call Me Maybe" that was popular at the time.
From that a testing harness named Jepsen was also developed.
Sure, no problem with that. I just don't understand why we can't call e.g. a buffer overflow in software X a buffer overflow in software X instead of "Adventures in Memoryland!!! The bug, that 0wnzors your boxorz!!".
> I just don't understand why we can't call e.g. a buffer overflow in software X a buffer overflow in software X instead of "Adventures in Memoryland!!! The bug, that 0wnzors your boxorz!!".
Since you appear to be sincere, it's because sometimes a lot of people like to interact in human ways, which includes a bit of storytelling to relieve the monotony of just the facts.
Yes, that was my first impression guessing from the title. All I'm saying is that the title should be a rough approximation of the article so that the reader can have some vague idea of the contents.
I'd say anyone who puts that much time and work into it is allowed to call it whatever they want. Besides, as DannoHung pointed out, it's actually quite appropriate.
Last time one of these posts came up, someone made the interesting point that this series, and especially the Jepsen project, may well be around long after everybody has forgotten about their namesake.
“Call Me Maybe” was the #1 selling single of 2012, was nominated for 2 Grammy awards, has been covered and parodied by numerous other artists, and is still played all the time on pop radio stations. The Youtube video has been viewed 700 million times. I don’t think it’s going to be forgotten anytime soon.
I'm not nearly as sure about that. Software products, like pop stars, come and go. "Call Me Maybe" and "Call Me Maybe" have about equal chances of outlasting one another, as do "Jepsen" and "Jepsen".
See https://github.com/codership/galera/issues/336#issuecomment-... (linked from the article)