Paxos in 25 Lines

GeneralMayhem · on March 21, 2017

This is missing what to me is the most important part of the algorithm: a quorum of acceptors must propagate writes to the learners. With just what's shown here, you're not tolerant to network partitions that cause a subset of the "accept" messages to be lost.

That process can of course be optimized in a number of ways that drastically cut down on the network overhead as compared to the naive MxN write pattern, but what's written here is not safe on its own.

jakewins · on March 21, 2017

Mostly unrelated, but a fun fact about quorums that I enjoy noting whenever I can because it still seems under-explored: A quorum != a majority. Currently most (all?) production implementations I've seen of RAFT and the various Paxoses use "majority" as the quorum algorithm, so the two get mostly conflated.

In my layman understanding: Given a set, a quorum is some method to choosing a sub set, such that any two such sub sets will always have at least one overlapping member.

Majority is one quorum algorithm - given a set [A,B,C], the majorities are: [A,B,C], [A,B], [A,C] and [B,C]. Any two of those sets will have at least one member overlapping.

However, majority is somewhat wasteful, because the latency of these quorum-based algorithms are almost always bound by the slowest member of the quorum - the more machines you need to wait for, the more likely one of them will be outlier-slow.

You'd potentially be better off choosing a quorum algorithm that requires less than a majority - because that'd mean, in the best case, fewer responses to wait for, lowering the probability that one of those members will be very slow. There are drawbacks to this - it makes fault tolerance and provisioning harder to calculate - but it's got some cool potential benefits.

Some cool ones to explore here: https://pdfs.semanticscholar.org/a243/7f18205414f6398b29c4f8...

elvinyung · on March 21, 2017

I would argue that by the time you've chosen Paxos or some other majority quorum commit protocol, you're already well aware that you're building a CP system, and that availability and latency aren't your main concern. A majority quorum is basically the most obvious (and somewhat brute force) way of providing serializable consistency in the system.

The one non-majority quorum commit protocol that most people are probably already familiar with is the "sloppy quorum" replication in Dynamo systems[1] (e.g. Cassandra, Riak, Voldemort, etc.). Basically, since the quorum is configurable on a per-cluster basis instead of being inherent to the protocol, and usually isn't a majority of the cluster, the system can still make progress when half of the nodes are unreachable. (But of course, as the paper notes, this means that you need to resolve conflicts some other way, which adds a whole bunch of complexity.)

1: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp...

dllthomas · on March 21, 2017

> you're already well aware that you're building a CP system, and that availability and latency aren't your main concern

Assuming you've chosen correctly between CP and AP approaches, this tells us that availability and latency aren't as important as consistency. But there's nothing that says they aren't arbitrarily close...

elvinyung · on March 22, 2017

Yeah, definitely -- I agree that the decision doesn't mean to just to blindly throw away availability optimizations once you've decided that consistency is important.

Actually, invoking CAP probably didn't add to my message. What I meant to say is that people don't talk about non-majority quorum commits that much because the interesting part is that the serializability comes with majority/overlapping quorums.

dllthomas · on March 22, 2017

As I read it, the comment you were replying to was still restricting its discussion to overlapping quorums, and merely pointing out that that's not actually synonymous with majority.

elvinyung · on March 22, 2017

Fair enough :) I guess my mind latched onto the "it still seems under-explored" part and wanted to try and respond to that.

jasonwatkinspdx · on March 21, 2017

I think you'd be interested in Flexible Paxos[1] if you haven't run into it.

[1]: https://arxiv.org/abs/1608.06696

jakewins · on March 21, 2017

Oh man, that is a really cool paper, thanks a bunch for sharing that. I've got next week off and lots of itch to try Go for network code.. might try this out!

sbanach · on March 22, 2017

That's pretty interesting. Is there a concrete example of a quorum definition where the probability of an outlier is improved vs majority quorum? I'm struggling to come up with one. I've always assumed majority is optimal, since you can tolerate outliers in (n-1) / 2 voters without seeing an outlier for the commit overall. Eg: for 3 node Raft - you'd need both followers to have an outlier before a client notices a slow commit.

gosubpl · on March 22, 2017

You are right: https://blog.acolyer.org/2016/09/27/flexible-paxos-quorum-in... There is also sample code and TLA+ proofs: https://github.com/fpaxos/fpaxos-tlaplus

jbangert · on March 21, 2017

This simplified algorithm doesn't distinguish between learners (readers) and proposers (writers) to the value. I'd say this conveys the core ideas of paxos, and it makes sense to treat learners as a (performance-critical) extension/optimization, just like the many optimized paxos variants in the literature. Another benefit of treating read/write as a single operation is that it serializes reads and writes (e.g. in a distributed log).

elvinyung · on March 21, 2017

I realize this is pseudocode, but I still feel like the bigger challenge is not in implementing a theoretically correct Paxos, but a production-ready one. It's probably pretty well-known the Chubby[1] team's experiences dealing with unexpected complexity from using Paxos in production.

A choice quote: "While Paxos can be described with a page of pseudo-code, our complete implementation contains several thousand lines of C++ code."

1: https://static.googleusercontent.com/media/research.google.c...

bastawhiz · on March 21, 2017

When I hear about these algorithms taking many thousands of lines of code in a "low-level" language like C or C++, I wonder how much of that could be simplified away if you didn't need to manually manage memory. Performance aside, how much of those "several thousand lines" would be unnecessary in a higher-level language?

I implemented Raft in a couple hundred lines of succinct JavaScript a few years ago. I can only imagine someone smarter than me could write a production-ready Paxos implementation in less than a thousand well-commented lines of JavaScript or Python.

elvinyung · on March 21, 2017

> I implemented Raft in a couple hundred lines of succinct JavaScript

But is it production-ready? :)

None of the extra complications described in the paper were inherent to C/C++. It covered things like leader leases, log compaction, handling disk corruption, and group membership changes -- optimizations that weren't intrinsic to Paxos itself, but still crucial for running it in production.

Another choice quote from the paper: "There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions."

Also, a random data point: etcd's Raft implementation stands at about 4000 lines of Go right now, not including tests.

bastawhiz · on March 22, 2017

> But is it production-ready? :)

Production-ready enough for my use case ;)

I also didn't mention Go in my post because--despite having managed memory--it's syntactically very long. Not a complaint, but all of the Go code I've seen and written tends to be "taller and skinnier" (less dense?) than the code I've seen and written in other languages like Scala or Python.

tines · on March 21, 2017

The paper linked anticipates your question in the sentences after grandparent's quote:

> The blow-up is not due simply to the fact that we used C++ instead of pseudo notation, nor because our code style may have been verbose. Converting the algorithm into a practical, production-ready system involved implementing many features and optimizations – some published in the literature and some not.

TuringTest · on March 21, 2017

https://en.wikipedia.org/wiki/Paxos_(computer_science)

bfung · on March 21, 2017

    1	proposer(v):
    2    while not decided:
    2	    choose n, unique and higher than any n seen so far

26 lines.

It's pseudocode, so not really only 26 lines as it needs some more supporting functions to "choose n, unique and..." and other stuff to make setting variable states atomic.

Good way to explain the algo though.

cortesoft · on March 21, 2017

Number of lines is the most ridiculous metric anyway. Most languages have no line length limit, just replace all newlines with semicolons, and you have a one line program!

d0vs · on March 21, 2017

> Most languages have no line length limit

Some languages do?

porges · on March 21, 2017

("Free-format") Fortran has a max line length of 132 chars, up from ("fixed-format") 72 chars on punch cards.

amagumori · on March 21, 2017

the ANSI C standard has a line length limit, so there are no guarantees that compilers have to function properly with longer lines than given in the standard.

cortesoft · on March 22, 2017

Many?

bfung · on March 21, 2017

Also, this is a resubmission from 3 days ago from the same user.

https://news.ycombinator.com/submitted?id=Cieplak

https://news.ycombinator.com/item?id=13896750

tromp · on March 21, 2017

Curiously, Paxos takes exactly as many lines as a self contained interpreter for a pure functional programming language, written in C :-)

http://www.ioccc.org/2012/tromp/tromp.c

http://www.ioccc.org/2012/tromp/hint.html

gosubpl · on March 22, 2017

For me this shows the difference between theoretical setting and what you would want to do in practice. I have been following 6.824 (where this is sourced from), to learn something about distributed systems programming and it was great fun to shed a lot of figurative sweat to convert those 26 (actually) lines into working "production" code. Hundreds lines of code, because in real-life we have packet loss, network partitions, etc. But the pseudo-code in the link itself is correct, however, it doesn't tell the whole story.

Now I am repeating that experience, as Akka project contributor ( http://akka.io/news/2017/03/17/akka-2.5.0-RC1-released.html ) on getting delta-CRDTs into Akka. And again - what was a few lines of pseudo-code in the original paper, or even tens of lines of real code but in some ideal setting ( https://github.com/CBaquero/delta-enabled-crdts ) is becoming literally thousands lines of "production grade" code.

Finally - I wholeheartedly recommend the 6.824 course to anyone interested in distributed systems. Even if you don't like strong consistency, you'll learn a lot about testing and debugging distributed systems, the knowledge you can re-use later in your career.

Maro · on March 21, 2017

Here's a version I wrote in C++ for ScalienDB about 5-6 years ago, this startup has since folded so it's dead code:

https://github.com/scalien/scaliendb/tree/master/src/Framewo...

Paxos: for replicating data

PaxosLease: for negotiating a lease, eg. leader

Quorum: pluggable "majority" rules, not that important

ReplicatedLog: use Paxos for each append, initiated by leader

ww520 · on March 21, 2017

Pretty nice code.

barhun · on March 21, 2017

everyone, please, read the following blog post before using any 'wow!'s in your next exclamation:

RAFT Explained – Part 1/3: Introduction to the Consensus Problem http://container-solutions.com/raft-explained-part-1-the-con...

"While Paxos can be described with a page of pseudo-code, our complete implementation contains several thousand lines of C++ code."

jlebrech · on March 21, 2017

I miss numbed code, cba scrolling up to a line and doing: end, return, space space, delete, enter, mouse, space space, end.

just type 8.5: code here

(float to insert between lines)

also no nesting.

then run a processor like go-fmt that checks the format for you.

and use the directory structure for class and methods, directory is a class, and a filename is a method.

jdnier · on March 21, 2017

Or if you're interested in the opposite of pseudocode, here's a TLA+ spec for Paxos.

Related to yesterday's TLA+ video post https://news.ycombinator.com/item?id=13918648

algorithmsRcool · on March 21, 2017

You forgot to include a link to the spec

jdnier · on March 22, 2017

Ugh, so I did. Here's the link: https://github.com/bringhurst/tlaplus/blob/master/examples/P...

krat0sprakhar · on March 21, 2017

Interesting! This should be renamed to Paxos Pseudo-code in 25 lines.

nojvek · on March 22, 2017

Can someone explain Paxos to layman. What is it even supposed to do?

hubert123 · on March 22, 2017

This is what I mean when I tend to say that all scientific papers should have a minimal reproducable working sample with instructions attached. Lets say I am interested in dam building with turbines and all its glory: One would assume that this is really complex cross cutting tech, but I still firmly believe that if you cant show me how to build a tiny sample dam that powers my mobile phone or my computer, you havent done your part to make your theory sufficiently reproduceable.

dpc_pw · on March 21, 2017

Here's one in 1 line:

run_paxos()

mickronome · on March 21, 2017

Very succinctly put, regardless if it is a function call or a builtin statement :)

I have used something similar to defuse endless arguments about which language is more expressive, or better, and turn it into a more productive discourse. I simply make a tentative assertion that there is a perfect language for every problem, one where only one line of code is needed to solve the problem, it reads as follows: doit

Then I follow up with stating that the language is probably rather useless for anything else.

I don't know why it usually works to open up the discussion, it seems to me as such a trivial and obvious observation, but apparently the perspective is something many rarely come to observe without prompting.

I'm well aware that 'doit' can't really be considered to be a language, except in a very limited sense, it can also simply be a function call, which maybe helps to bring into focus the intersection between language, libraries and their relative applicability to the task that needs solving, and the environment it must be solved in.

Trivial, obvious but somehow deeply at the heart of writing the correct code to solve a particular problem, because everything is a tradeoff somewhere between extremes.