Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How you could have come up with Paxos yourself (yshui.dev)
227 points by todsacerdoti on Oct 27, 2020 | hide | past | favorite | 53 comments


I wish more explanations would go like this: iteratively. It's much easier for me to see where I suddenly start to struggle. The author also used language that feels more familiar to me, which helped a lot.

Unfortunately, a lot of writers just dump the information and use terms from different fields. And I get that, but it makes me slower to understand it.


I think it's interesting that Lamport chose this style for Paxos Made Simple[0] Which iteratively goes through a process of proposing a solution for a problem (proposal 0: accept the first value the acceptor comes across) and then reveals the problem with it, and amends the proposal to fix the problem

[0] https://lamport.azurewebsites.net/pubs/paxos-simple.pdf


This Kerberos explanation does something similar at a high level through dialogue. Always did like it.

https://web.mit.edu/kerberos/www/dialogue.html


At least for paxos this is more or less the standard method of explaining it, see: https://youtu.be/JEpsBg0AO6o

In paxos's case the logic is very, very simple - it's just the reasons for the logic are difficult to sort out. So teaching it iteratively works very well, because you show how simpler versions fail.


> So teaching it iteratively works very well, because you show how simpler versions fail.

This is, in general, a very good approach for teaching (although time consuming).


The key thing about it being iterative, I think, is that it builds up the justification for seeing that this is an effective and reasonable solution, one step at a time.


Yeah it's sort of a microcosm of the general path of human learning: build a better model, shoot some holes through it, build a better model. I don't think it's any accident that the socratic method has stuck with us so long


Ability to write an explanation like that is a sign of a great understanding of a topic. If you can write something like that - you've achieved such understanding.


I like these kinds of explanations. As Feynman said: “What I cannot create, I do not understand”.


I do envy the founding fathers who got into their disciplines at the ground floor. I feel like there is so much I do not understand precisely because I cannot create it.


Not sure how many people already know this but Leslie Lamport himself also put out Paxos Made Simple as a 14pager explainer

https://lamport.azurewebsites.net/pubs/paxos-simple.pdf


Inventing Paxos wasn't the hard part -- the hard part was proving it correct. There's a reason TLA+ exists and was invented by the same researcher!


Paxos shouldn't be the first protocol you learn -- I think it gives consensus this mistaken reputation for being "difficult to understand".

Consensus protocols (and the related problems of state machine replication, permissioned blockchains) have really come a long way over the years. They've become more performant and much simpler (in my opinion).

(I'll take the opportunity to pub streamlet: it's simpler, has only two message types "propose" & "vote", and it tolerates malicious faults. https://decentralizedthoughts.github.io/2020-05-14-streamlet...)


All the supposed "simpler" consensus protocols I've seen just hide all the complexity somewhere (like raft with its leader election). Where does streamlet hide its complexity?


It hides the complexity in the first bullet point: "The protocol proceeds in synchronized epochs."

That is, we're now solving consensus for a synchronous system.

So for example the Paxos assumption:

>Messages are sent asynchronously and may take arbitrarily long to deliver.

Does no longer hold AFAIK.

Edit: Good article, they even say this:

>Partially Synchronous Network: The network can at times be unreliable, or adversarial; the protocol must not lose consistency in this case. However, when network conditions are good, i.e., when honest players can communicate with each other within 1 second, the protocol must make progress.


No complexity is hidden here. Streamlet is (essentially) in the same setting as paxos. Just like in paxos, messages can take arbitrarily long to deliver.

The only assumption is that clocks are synchronized, which is much, much weaker. (We can actually remove even this assumption. But it's easier to think about this way.)

i.e. I think it's monday at the same time you think it's monday, or I think it's epoch 1 at the same time you think it's epoch 1. or my quartz crystal oscillates at the same rate as your quartz crystal. Importantly, a message sent during epoch 1 might arrive in epoch 3, or never at all.

Paxos is also in the partially synchronous setting - it requires some network synchrony in order to terminate, just like streamlet (For truly asynchronous protocols, there's a whole, complicated literature).


>Paxos is also in the partially synchronous setting - it requires some network synchrony in order to terminate, just like streamlet

As far as I've understood it Paxos is fully asynchronous, but you can either guarantee correctness or termination and never both for those systems. Paxos decided on correctness. Am I wrong in this?


Right, when run in a fully asynchronous setting, where the network adversary can always deliver messages in the order of their choice, the FLP impossibility result says that Paxos cannot terminate.

Thus, for Paxos to make progress, the network needs to eventually "heal" and deliver messages in a timely manner -- that's what we mean by partial synchrony. Streamlet, likewise, tolerates arbitrary network delays (in the sense that no two players will agree on different things even if Comcast maliciously scheduled every message), but also requires partial synchrony to make progress (i.e. once in a while Comcast will deliver messages in timely fashion and in order, like you expect).

(Partial synchrony is arguably the most practical of the theoretical synchrony models for consensus. There are other models, i.e. "periods of synchrony", but they all end up capturing the same idea).

FLP, however, is a fairly weak statement. It only rules out termination for deterministic protocols. We can in fact design randomized asynchronous protocols that make progress (with good probability), regardless of the ordering in which messages are delivered. I.e. as long as Comcast eventually delivers every message -- and letting it choose delays arbitrarily -- a randomized protocol could make progress. These are the "true" asynchronous protocols. (Paxos is not one of them)

I guess the original asynchronous consensus protocol is Ben-Or's protocol (which takes 2^n time in expectation IIRC, where n is the number of players). We can reduce the running time using some (fairly sophisticated) cryptography -- the most famous is the Canetti-Rabin protocol, and Cachin and many others have worked on it since. Most of the complexity (for polynomial time protocols) is in building a "global common coin" -- i.e. a global coin that everyone agrees on the result of; once you have that we can use Mostefaoui's protocol for instance. Building this coin is incredibly complicated (much more so than Paxos) unless we assume some strong cryptographic trusted setup (i.e. a nice threshold signature scheme). How practical any of these schemes are is an open question.


(For more, Ittai Abraham has a series of nice posts: https://decentralizedthoughts.github.io/2019-06-01-2019-5-31...)


I don't think it hides any complexity, but I guess that's for the reader to decide.

Raft on the other hand... I've implemented versions of both paxos and raft, and by golly was the latter more difficult to get right.


> Raft on the other hand... I've implemented versions of both paxos and raft, and by golly was the latter more difficult to get right.

Could you clarify what you mean? "Paxos" as defined in the literature (and as presented in this article) is a single-shot consensus algorithm. Raft is a leader-based replicated state machine. They are not comparable at all and it's not surprising if you find Raft more difficult.

However, in practice, when people say "Paxos" they usually mean some variant of MultiPaxos and "Raft vs Paxos" is typically interpreted as "Raft vs MultiPaxos". It's also well-established that MultiPaxos and Raft have roughly the same complexity.


(In the context of embedding paxos in some sort of proof-of-concept replication system. Something like Multi-paxos.)

You're right, I guess we generally talk about state machine replication instead of one-shot systems.


Looks like streamlet requires synchronized time. That can be hard to achieve.

From the link; "The protocol runs sequentially in synchronized ‘epochs’ that are 2 seconds long. Every player starts in epoch 0 at the same time...."


Yeah, but synchronizing time gets easier the bigger the accuracy window is, and approaches impossible the smaller. 2 seconds in well within NTP's wheelhouse on all but the most jittery and unusable networks.


We could actually run a generic clock synchronization protocol to synchronize clocks over a partially synchronous network. I guess paxos kind of builds that in, which adds to its complexity.

Safety in particular doesn't require any timing assumption.

(Note that messages can still be delayed arbitrarily in both protocols)


I really wish I could remember who to correctly attribute this quote to, but I heard it from Randy Shoup - paraphrased:

Consensus protocols are either Paxos, Paxos with unnecesary additional cruft, or broken.


I think it was Mike Burrows, but I can't remember where I heard or read that.


One of the Raft introduction videos claims that there are approximately 12 people on the planet who fully understand Paxos.

This is contrasted with a Leslie Lamport quote insisting that it is so simple everyone could understand it.

I suspect this means that Mr Lamport would make a terrible professor.


I think Lamport's Paxos paper was, well, kind of a bad way to teach people about Paxos, but a good thing to have as part of Paxos teaching.

IMO, Paxos is much easier to "fully" understand than Raft. The question really becomes, define "fully." What set of performance scenarios, with member connects/disconnects and other shenanigans, do we need to consider in order to "fully" understand Paxos (or basic Multi-Paxos)?

What I really like about Paxos is that it's pretty easy to re-invent the exact algorithm, or come up with your own Multi-Paxos definition, after you've forgotten what the original exactly was, once you've grokked the general principles of how it works.


I was at the talk where that assertion was made. It ticked me off. This sort of thing is learned helplessness, no different from people saying "they aren't very good with computers".


As someone who was taught to drive stick shift, I am well aware that there are things a person can do which may actually improve their quality of life for having done them. Within that group of things is a large subset which we shouldn't force people to have to think about. Particularly if those people are tasked with also keeping track of other important things.

Yes, there are plenty of people who can't be bothered to learn something because they are lazy, intimidated, or stupid. But it's not even most, and that presumption, of all of the egotistical self-centered bullshit we pull, ticks me off most of all.

We only have time in our lives to become experts in a handful of things (some people think that number is one, or two, which is a shame), and lots of people have other shit they'd rather be doing, possibly (hopefully!) including things that you can't be bothered to learn about. If we all worry about the same things then we don't get where we're going.


There's a difference between being able to drive stick shift, and "fully understand"-ing how to drive stick shift.

I would absolutely say that I can drive stick shift because I've proven it to examiners in multiple countries (not every country permits a license exchange sadly), but to "fully understand" something is (well) something of a higher bar so I would ask how high? to make sure I understand what it is they expect me to understand. There might be a nuance that you (or I!) hadn't considered, for example driving stick shift with an unsynchronised manual transmission, or shifting without clutching, or something else entirely, and the point of the "fully understand" qualifier might be to call attention to something like this.

For example: Say I can implement paxos in my sleep, debug someone elses implementation, and I can even design and implement a new consensus algorithm based on paxos where I may satisfy requirements of parts in slightly different ways that are extremely domain-specific but may be beneficial from a performance/time/space/cost perspective. This is what someone might mean by "fully understand"-ing paxos, but making some changes to paxos may take time for me to convince myself it is correct (or incorrect!), or there might be parts I think are essential/important that aren't (in some situations), and so I would still be interested in learning something new. Who wouldn't?

However, I don't have good experiences with people who say "there are only X people who fully understand Y so you should use Z" -- it's usually means they don't understand Y as well as they understand Z, and that's not what I want from an expert.


Nothing wrong with having different learning priorities. Just be honest about it instead of putting yourself down & normalizing learned helplessness.


How do clients typically connect to (multi-shot)Paxos clusters and how can they be sure their request is processed exactly once if the Paxos replica they connect to fails? How does this work for 'leaderless' versions of Paxos?


How long do real world client wait before retrying? Is there a general guidance here?


1. Look at the latency of historical successful commits and your success_percentage. Take the latency at the success_percentage-th percentile and call it max_success_latency. This should be bounded closely from below by longest round trip time between nodes. If it's not, it's worth fixing.

2. Look at your external SLO and get target_latency and target_success_percentage from your thresholds.

3. Retry on failure. Retry on timeout as late as target_latency-max_success_latency and optimistically as early as max_success_latency. The wiggle room gives you a helpful idea of how close you are to breaking SLO. The earlier you retry, the more likely you are to overload the backend if it slows down due to load. The later you retry, the more likely you are to break SLO under load. Use a rate-limiting back-off strategy in clients to avoid overloading the backend completely. Probabilistic rate-limiting to the observed success rate (plus a little) on each client works pretty well.

4. Provision your Raft/Paxos for (1 + (1 - success_ratio))^max_retries times the maximum expected traffic to account for the load from retries.

Note that if (max_success_latency * 2) > target_latency AND success_percentage < target_success_percentage then you will need optimistic retries which can put quite a lot of load on the backend and even that may not keep you within SLO; it mostly depends on whether failures/timeouts are independent or data-dependent.


Fantastic answer. Thank you


I'm sorry to be difficult, but my late step-father would respond to questions like this with "how long is a piece of string?"

There is no useful direct answer to a question like this.

Having said that, what are your normal communication latencies between nodes, P50-95-99? what are those numbers when the system is under heavy load?

You can model the impact that different retry intervals will have on your system


I started with JINI back in the day and added dice-throwing and split-brain detection based on the higher-level primitives that JINI provided. I had no clue what I was doing and it usually ended not working ;)


Something I've wondered about Paxos: does the voting not rely on knowing how many voters there are, to see if there is a majority? That makes it harder to survive a network partition, right?


Cluster membership is specified before initialization. It's possible to change membership but this requires achieving consensus on the membership set by running paxos.

As to your question, it really depends what you mean by "survive". Paxos will not make progress unless it can achieve quorum, usually set as the majority of nodes. However it won't corrupt data either (multiple different tx values being chosen during consensus). It is thus a good fit for non-adversarial environments where you own & control all the nodes in the cluster and want to gracefully handle common faults (nodes losing power or crashing, network messages lost).


My question was more about hard splits: what to do when you have 2 DC's and they lose connectivity between them?

The larger pool can still have majority, but the smaller pool can not. Does the smaller pool change membership (because it can't see the larger pool) and continues on it's own? What if the smaller pool has a cold start and has no idea about the larger pool?

My point is that there needs to be some consensus on pool size to decide what constitutes a majority vote.

Implementations can make reasonable assumptions and rules for this, but afaict it's not covered by the protocol itself.


If you lose connectivity then the non-majority pool will just continually try to reconnect to the larger pool, which will happily continue on as it has a majority. The non-majority pool will not change membership to itself. Membership changes require consensus among the original cluster participants, which the non-majority pool cannot achieve. If connection is re-established the protocol gracefully handles bringing the nodes up to speed on any transactions they missed. If the majority pool changes membership to kick out the non-majority pool nodes, when reconnection is established they'll just be ignored.


A 2 datacenter configuration doesn't give you much if you want to survive a temporary outage of said datacenters. Observe that if you wanted to survive the loss of the dc with the minority of nodes you could place all nodes in the other dc. This is one of the reasons aws tries to have three availability zones per region. In practice things are messier than this because dcs don't always fail as a unit.


Related, here is Raft's (another distributed consensus implementation) documentation: https://raft.github.io/. The interactive visualization helps with understanding how it works.


Not to call you out specifically but there's a weird pavlovian response on this site where people have to say the word "raft" when they see the word "paxos". It happens in every comment section without fail. The consensus world moved on from the paxos vs raft battle; the user learning study happened in 2013. Seven years ago! It's time to update our comment sections to reflect that. Here's an excerpt from a 2015 parody blogpost:

What’s etcd?

-It’s an implementation of RAFT.

OK, so what’s Raft?

-It’s like Paxos.

Christ, how deep down this fucking rabbit hole are we going? I just want to launch an app. Sigh. Fuck, OK, deep breaths. Jesus. OK, what’s Paxos?

-Paxos is like this really old distributed consensus protocol from the 70s that no-one understands or uses.

Great, thanks for telling me about it then. And Raft is what?

-Since no-one understands Paxos, this guy Diego…

Oh, you know him?

-No, he works at CoreOS. Anyway, Diego built Raft for his PhD thesis cause Paxos was too hard. Wicked smart dude. And then he wrote etcd as an implementation, and Aphyr said it wasn’t shit.

https://circleci.com/blog/its-the-future/

Anyway Paxos was published in 1998.


I posted the link as another example for understanding a consensus implementation due to its interactive visualization. I was not intending on starting a consensus war, apologies if it came across that way.


In Raft, what would happen in case of following:

1. The master has sent request to followers to update their state

2. Half of the followers successfully update their state and other half fail for some reason

3. Now, master crashes and new master gets picked from the other half which never updated their states

Wouldn’t we have inconsistent state now?


A commit has to be replicated to majority of the cluster, and a leader is elected by approval of majority of the cluster.

More than that each of the nodes approving the commit would reject a candidate who has log inconsistent with their history. Therefore a node with inconsistent log could not be elected as a leader.

In raft paper it's shown in 5.4.3 (safety argument).


If half the nodes fail the cluster has failed. Raft and Paxos need at least a majority of nodes. It's actually not beneficial (except for load-balancing) to run an even number of nodes.


> 3. Now, master crashes and new master gets picked from the other half which never updated their states

> Wouldn’t we have inconsistent state now?

No. The half that received the event will all have a higher event count (what raft refers to as a "term"), so one of them will become the master because they're not voting for someone who knows less than they do (and neither is anyone else!)


Why doesn't each client pick a random number and then send it to each other client? When you receive a message add it to your total. Once everyone gets all the messages they all have the same number.


Not sure exactly what you're looking to solve with that proposal, but, I can see one obvious question - how do you know if you've heard from a node or not? Until I hear from it I don't know its number. With a quorum based election, I only need to hear from n/2+1 nodes. I don't have to know anything about the nodes themselves, only the number needed to form a quorum. With random numbers, I'd need to hear from every member to know if I've reached consensus or not. That, or I'd need a perfectly partitioned network (i.e., A and B can talk, but neither can talk to C. If A can talk to B, and B can talk to C, but C can not talk to A, at least when trying to make a decision, no decision can be reached)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: