Strong Consistency Models

rdtsc · on May 18, 2014

Very good post! Well all of Aphyr's "Call Me Maybe" series is great.

A lot of modern systems are distributed and quite often it seems they are designed and built without much thought given to these kind of issues. In the "Call Me Maybe" series for example he shows how a lot of popular NoSQL, webscale etc etc. databases really fall on their face when network partitions hit.

And there clearly is enough literature written on this but it is scholarly articles or research papers, that don't get necessarily get read by architects of most distributed system. (Heck I imagine, a lot of distributed systems are created so in a ad-hock way. Something like, "Hey Jim we need to have a hot spare fail-over machine in the US-EAST zone". ... and now they have a distributed system).

Also I am not sure how much this is taught in schools. When I was in school distributed systems mean, MPI, mesh topologies and so on. HPC type things. Nobody said absolutely anything about network partitions.

Speaking of network partitions. The question I have, can one make the chance of partitions happening low enough to be discounted? Or does it not even make sense to talk about that? For example, imagine the classic scenario of a single data center. Switches do fail sometimes. What if there is a redundant network, running on a separate physical interface (say eth1 not the default eth0). Now partitions can be detected and simultaneous failure of 2 networks switches now has to be happen. This is not even a hypothetical scenario. There is a Japanese NoSQL out there that nobody has probably heard of -- Hibari. It features such a "partition detector" application.

http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en....

And the code:

https://github.com/hibari/partition-detector

I saw that a while back and now I remembered about. So is that a practical solution or there are fundamental theoretical issues with this approach?

ryanobjc · on May 18, 2014

On the partition side of things... not really. Usually you end up with coordinated failures of various types.

For example, lets say you run redundant networks. Are the physical paths diverse? What happens if someone trips on BOTH network cables? What happens if the redundant network is on the SAME switch hardware? What happens if yes you use two network switches, but they are hooked up to the same UPS? Or both UPSs are on the same circuit breaker?

The classic reference on this is from google: http://www.catonmat.net/blog/wp-content/uploads/2008/11/that...

There are a number of theoretical issues with these approaches, starting with the FLP result.

Also realistically, eventually all practical solutions fail, often due to user error.

rdtsc · on May 18, 2014

Ok let's say mostly diverse physical paths, both tied up nicely into bundles, different switch hardware, different UPS, different power grids (some nice data centers will even have that!).

At this point other kinds of odd failures are more likely, than failure of both networks paths at the same time, let's say meteor strikes, disgruntled employee wiping out backups and data.... Can one even begin then to consider an AC system? And if so what would it look like?

Morgawr · on May 18, 2014

Nice article, all of this was covered in my Concurrency and Multithreading course in university and we used this book: https://www.elsevier.com/books/the-art-of-multiprocessor-pro...

It's a really great book, I suggest people read it if they are interested in this type of stuff (it focuses on concurrent, not distributed systems, so it's not quite 100% the same topic).

mjb · on May 18, 2014

This is a very interesting and useful post. I like how Aphyr started with Linearizability, which despite the unfamiliar name is much easier to reason about (for most) than serializability and most other models. This is well worth reading for anybody interesting in distributed databases.

A few nit-picks:

- The notion of linearizability he uses seems to be implied to be stronger than the original definition from Herlihy and Wing (http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf). Compare Theorem 1 in that paper to the hints made about global consistency in that post. Still, this is very nit-picky, and the main points of the definition are captured.

- Serializability, as typically implemented in most of the RDBMSs in common use, is both weaker (as Aphyr points out) and stronger than the notion used in the literature. Weaker is covered well in Aphyr's other posts. Most databases are also stronger in some sense, and closer to strong serializability, because they allow clients to observe values directly in transaction scope, which constrains their ability to reorder transactions. There are many legal reorderings of transactions that databases could do under the definition of serializability that they don't do.

ryanobjc · on May 18, 2014

Great write up!

I wish there was a little bit more color around what "availability" really means. Yes some data stores are offering higher availability, but that doesn't mean they are immune to node failure. Even for a "CP" system, latencies can go up, and with enough node failures, partial unavailability and data loss can occur.

And few systems survive the loss of a rack without some issues! Let alone a total network in half partition!

In theory, a strongly consistent system could have a failover that is so small that it is, from a user pov, always available. It would have very easy to understand programming model. But such a system doesn't really exist, open source wise, at this time. So we just won't know.

mjb · on May 18, 2014

> I wish there was a little bit more color around what "availability" really means.

The definitions used by Gilbert and Lynch in their CAP proof requires availability for updates at all nodes. Other proofs use availability for updates (or consistent reads) at a msjority of nodes.

> Yes some data stores are offering higher availability, but that doesn't mean they are immune to node failure.

All theoretical models like this are always susceptible to real-world implementation limitations. Still, it's very useful to think about these theoretical models, because they tell us what we could achieve with an ideal implementation. Knowing that allows us to chose implementation techniques and approaches suitable to the problem, and to not waste time on trying to implement the impossible. Just because something is possible doesn't mean that it can be, or has been, practically implemented. If something is impossible, though, we know we shouldn't spend time trying to do it.

> Even for a "CP" system, latencies can go up, and with enough node failures, partial unavailability and data loss can occur.

Of course. These models are actually pretty silent on durability, which is frequently seen as a seperate matter, or a limiting case of availability, depending on the area of research. Durability is really practically important, but this isn't really the area of research that addresses it.

Partial unavailability because of theoretical limitations and partial unavailability because of implementation limitations are different things. We can improve on the latter, and insist our software vendors do the same, but the former we just need to work around.

> In theory, a strongly consistent system could have a failover that is so small that it is, from a user pov, always available.

The definition of consistency and availability doesn't allow you to "fail over". Completely ignoring availability, there are two cases of failover: one where the system still appears to be consistent (using one of the strong consistency models described here), and one where the system becomes eventually consistent. The various proofs (such as Gilbert and Lynch's CAP proof) imply that you can't "fail over" and keep consistency in the case where some nodes are uncontactable. The definition of "some nodes" depends on the exact proof, but there is no way to fail over into a minority partition and keep consistency. It's not possible.

On the other hand, there are loads of practical and useful ways to fail over into a minority partition that still gives useful eventual consistency semantics. It all depends what you need.

ryanobjc · on May 18, 2014

Yes, there is the notion of what CAP requires, but we cannot cling to the strict definitions of these papers. They must be translated in to lay persons terms. If you don't, how else can you communicate with your stakeholders? How can you communicate with your users?

Personally I refuse the notion that this is out of scope of our jobs. Think of the ability and power of a so-called 'renaissance person' - can do anything.

> The definition of consistency and availability doesn't allow you to "fail over". Completely ignoring availability, there are two cases of failover: one where the system still appears to be consistent (using one of the strong consistency models described here), and one where the system becomes eventually consistent. The various proofs (such as Gilbert and Lynch's CAP proof) imply that you can't "fail over" and keep consistency in the case where some nodes are uncontactable. The definition of "some nodes" depends on the exact proof, but there is no way to fail over into a minority partition and keep consistency. It's not possible.

Minority partition - maybe you could restate this paragraph in terms of discoveries such as paxos? Does paxos not allow progress and data retrieval during some failure scenarios? Yes, realistically once enough nodes are lost, things grind to a halt. But this is how dynamo works as well. Once the # of nodes available declines below the R or W factor, the algorithm stops making progress.

nemothekid · on May 19, 2014

>but we cannot cling to the strict definitions of these papers

This seems like an incredibly strange sentiment I cannot grasp what you mean. In another field it seems like you would be trying to say "we cannot cling to the strict definitions of gravity." At the end of the day proofs are proofs.