A More Flexible Paxos

mjb · on Aug 15, 2016

See Lamport's "Cheap Paxos" and "Vertical Paxos" papers for a treatment of what seems like very similar ideas. Vertical Paxos describes using a 'configuration master' (or, possibly, a larger quorum on a larger jury) to make changes to a smaller jury. The details are a bit subtle, but the PlusCal code in the paper lays things out very clearly.

Also check out Check out Peleg and Wool, "The Availability of Quorum Systems" (it's a bit old-fashioned, but still very interesting) for a thorough discussion of the availability argument (separate from Paxos, but in context of Quorum systems in general).

sougou · on Aug 16, 2016

I'm the author of the blog. Thanks for the posts and interest shown.

This proposal is a generalization of the core Paxos algorithm. It's therefore orthogonal to all customizations and adaptations. I believe that they can all benefit from it, including RAFT.

If you read the thread on raft-dev, you'll see that Heidi Howard has independently come up with the same idea, and is working on a proof for it.

teraflop · on Aug 15, 2016

There was some discussion of this post on the raft-dev mailing list the other day: https://groups.google.com/forum/?fromgroups=#!topic/raft-dev...

ahachete · on Aug 16, 2016

I'm not Paxos expert either, but I see some issues with the current post that are probably not correct / totally precise:

- That 7 or more nodes exhibit degraded performance, and that 5 is the sweet spot seems like a handy-wavy statement. I'd like to see numbers. And I believe those will be related more to implementation details rather than algorithmical limitations. Sure, with more nodes, communication overhead grows. But only a majority needs to agree, so there's no need to wait for the slowest node. Sure there's a sweet spot, but I wouldn't say it's exactly 5.

- Paxos does not perform leader election. There's indeed, no concept of a leader. There's only proposer, acceptor and learner. There's a leader concept in Multi-paxos, but that's a different story, and it's just a performance improvement, not a requirement.

- The original Paxos paper may be harder to understand, but "Paxos made Simple" is not. Probably easier to understand (IMVHO) than Raft's paper.

- The post assumes, as most Paxos implementations, that the three Paxos roles (proposer, acceptor and learner) are collapsed into a single node. They could be in different nodes. This may challenge some of the assumptions made about the "regular" Paxos algorithm.

- What the post proposes could probably be easily simulated with a varying number of proposers, acceptors and learners (like for instance having a lower number of acceptors than learners). This should result in the same benefits of the modified algorithm... but without modifying the algorithm. This is important, since original algorithm is proven mathematically.

dsp1234 · on Aug 15, 2016

Just like the author, I'm no Paxos expert.

However, I see a clear issue with the proposed model.

Taking the authors 11 node set up. 9 nodes are required for leader election. 3 nodes are needed for proposals.

Assume that a network partition splits the nodes with 5 nodes on side A and 6 nodes on side B. Assume also that the leader is on side B (though it works without loss of generality for side A as well).

At this point, no leader re-election can occur, since no side has the 9 total nodes required for a re-election. It's obvious that if the leader dies (or died as part of the partition) that no leader can be re-elected.

With the partition in place:

1.) let 1 node in side A make the proposal "X=3"

2.) Let 2 other nodes in side A agree to the proposal that "X=3"

3.) Let 1 node in the side B make the proposal "X=0"

4.) Let 2 other nodes in side B agree to the proposal that "X=0"

5.) Let the partition heal

6.) The cluster is now split brain on whether "X=0" or "X=3". According to the rule provided, the leader must accept any successful proposal.

Note that this split brain does not even require a network partition. Assume no network partition, and nodes are labeled 0-10. The leader is currently node 10.

1.) Let node 1 propose that "X=4"

2.) Let node 2 propose that "X=5"

3.) Let nodes 3 and 9 agree to the proposal that "X=4"

4.) Let nodes 6 and 8 agree to the proposal that "X=5"

5.) Split brain occurs as the leader has to accept any successful proposal, which includes "X=4" from nodes 1, 3 and 9, and also "X=5" from nodes 2, 6 and 8.

oggy · on Aug 15, 2016

The split-brain scenarios you mention cannot occur, since all the proposals have to go through the leader before they are agreed to by any processes.

Moreover, the leaders are lease-based in the author's proposal, so I believe that what he describes correctly solves consensus. However, his proposal doesn't really give much of a resilience improvement over the original Paxos, since the system, in the worst case, tolerates only 2 failures (leader + one more process), same as a 5-process Paxos. Nevertheless, an interesting idea, one that I don't think I've seen elsewhere.

maccilia · on Aug 15, 2016

So I'm pretty a similar problem could exist. If to accept a proposal we only need 3 nodes, not a majority, and we manipulate network latency in arbitrarily bad ways, a similar split brain could occur. Basically, you could commit the same transaction twice under different proposal numbers.

oggy · on Aug 16, 2016

Consensus is not about transactions, it's about agreeing on a value (a single value, strictly speaking, unlike atomic broadcast, which is about agreeing on a sequence of values, although people often identify the two). But anyway, your argument is not in any way concrete - try to come up with a counterexample. Having thought about it more, I am confident that the author's system is correct (in the sense that it always preserves agreement).

f0urtyfive · on Aug 15, 2016

Once you're planning work arounds for the "assume half your servers fail" scenario, I think it's time to admit that eventually, something is going to go down that is not automatically recoverable.

ori_b · on Aug 15, 2016

The problem is that half the servers failing is indistinguishable from a switch connecting two racks being flaky, or anything else which can lead to half the servers being temporarily disconnected from the others.

ketzu · on Aug 16, 2016

Classical consensus only solves the problem for up to 33% failure (3f+1 nodes, with f failures), having half of your servers fail can not be done with paxos.

f0urtyfive · on Aug 16, 2016

Why wouldn't you use redundant switches rather than n(2) servers?

ori_b · on Aug 16, 2016

Why would I buy two switches when I can solve the problem in general with a consensus algorithm that handles hosts going away for any reason?

Especially since a flaky switch isn't the only issue. Power loss could bring down even redundant backbone switches, but leave communication within the rack going. You could accidentally push a bad routing config and blackhole traffic to a bunch of hosts. You could send out a bad push that intermittently breaks connectivity to some hosts. And so on.

fizx · on Aug 16, 2016

There's no theoretical reason why multipaxos can't be implemented hierarchically in order to increase throughput or expand the key space beyond what a small number of machines can handle. But paxos itself is complicated enough to deter almost all implementers, so add two additional levels of complexity? I don't think I'm surprised no one has tried.

Paxos: Agree on a value.

Multipaxos: Agree on a Leader, who will coordinate a set of values.

Hierarchical Multipaxos: Agree on a Leader, who will coordinate a set of Leaders, who will each coordinate a set of values.

If you squint hard enough, this looks almost like a B-Tree.

im_down_w_otp · on Aug 16, 2016

riak_ensemble is a hierarchical multi-paxos shaped system. It's what Riak Strong Consistency buckets are built on, but it's also a stand-alone consumable framework by itself.

fizx · on Aug 16, 2016

Nifty! Did not know that.

bogomipz · on Aug 15, 2016

"With systems getting more and more distributed, the Paxos algorithm has been gaining popularity."

Has it though? I feel like many of the new distributed systems I read about are Raft based consensus.

dragonwriter · on Aug 15, 2016

> Has it though?

AFAICT, Yes.

> I feel like many of the new distributed systems I read about are Raft based consensus.

Sure, and Raft has probably been gaining popularity even more than Paxos has. With more and more distributed systems and more attention on guarantees for such systems, its possible for both Paxos and Raft (and perhaps other techniques) to be getting more popular.

rdtsc · on Aug 16, 2016

I think there is more talk on HN about Raft. It ended up in some new visible open source projects etcd, RethinkDB. Any major companies the size of Amazon or Google running on raft?

As for Paxos I think Amazon, Google (Spanner), Microsoft? have all implemented Paxos based systems are built and in production for a while. Basho has made it part of their Riak database etc.

I mean, sure if popular is just number of times it appeared on HN recently, yes, raft wins hands down. But otherwise, I think Paxos still seems like the workhorse of distributed consensus.

seschwar · on Aug 16, 2016

> Any major companies the size of Amazon or Google running on raft?

Google's Container Engine runs Kubernetes, which uses etcd, which uses Raft.

rdtsc · on Aug 17, 2016

It runs Kubernetes for its users to run on, but does Google run its search or ad stuff on it?