Forfeiting Partition Tolerance in Distributed Systems

jhugg · on July 6, 2015

I don't understand. When you run into partitions, you need to make a decision, risk being inconsistent or shut down. Ignoring the problem is just choosing inconsistency by default. You haven't actually gone CA.

nkeywal · on July 6, 2015

You could also shut down definitively, for example if the system is corrupted and cannot restart after a partition. There is no inconsistent history in this case, just complete unavailability (you could also claim being CP).

You can also be partly available and partly inconsistent (2PC with heuristic resolution). Here you're not AP nor CP.

Partition intolerance (CA) is a specification. It's saying any network partition is a serious issue for the system, and that may break one or more invariants (ex: atomicity in a sql database).

jhugg · on July 7, 2015

> You could also shut down definitively, for example if the system is corrupted and cannot restart after a partition. There is no inconsistent history in this case, just complete unavailability (you could also claim being CP).

This is precisely CP.

> You can also be partly available and partly inconsistent (2PC with heuristic resolution). Here you're not AP nor CP.

This is basically what the AP systems do, with various ways to manage inconsistency. Dynamo-style EC is but one.

> Partition intolerance (CA) is a specification. It's saying any network partition is a serious issue for the system, and that may break one or more invariants (ex: atomicity in a sql database).

If you can break C in the face of a partition, then you're not CA, are you?

CA is not meaningful. CAP is about choosing between availability and consistency in the face of partitions, which are essentially unavoidable in any non-trivial multi-node system. There are maybe some interesting things to say about latency, but I suspect there isn't much that hasn't been said.

Side note: Mike Stonebraker posited in 2010 that partitions on a network are rare. I'm not going to outright call him wrong, but for the purposes of someone building a distributed system to be run for others on a non-appliance, you're going to run into plenty of partition events on a LAN. VoltDB changed the way our product behaved in version 3.0 to aggressively kill nodes if there is any potential for split brains. We originally intended this feature for the cloud only, but too many users with their own hardware were hitting partitions in surprising configurations.

nkeywal · on July 7, 2015

> This is precisely CP.

Hum. I would prefer to call 'precisely CP' a system that shutdown one of the partition, not one that cannot restart after a partition. Even if formally you can use both (i.e. CP/AP)

> VoltDB changed the way our product behaved in version 3.0 to aggressively kill nodes if there is any potential for split brains. We originally intended this feature for the cloud only,

Quite interesting.