This is missing what to me is the most important part of the algorithm: a quorum of acceptors must propagate writes to the learners. With just what's shown here, you're not tolerant to network partitions that cause a subset of the "accept" messages to be lost.
That process can of course be optimized in a number of ways that drastically cut down on the network overhead as compared to the naive MxN write pattern, but what's written here is not safe on its own.
Mostly unrelated, but a fun fact about quorums that I enjoy noting whenever I can because it still seems under-explored: A quorum != a majority. Currently most (all?) production implementations I've seen of RAFT and the various Paxoses use "majority" as the quorum algorithm, so the two get mostly conflated.
In my layman understanding: Given a set, a quorum is some method to choosing a sub set, such that any two such sub sets will always have at least one overlapping member.
Majority is one quorum algorithm - given a set [A,B,C], the majorities are: [A,B,C], [A,B], [A,C] and [B,C]. Any two of those sets will have at least one member overlapping.
However, majority is somewhat wasteful, because the latency of these quorum-based algorithms are almost always bound by the slowest member of the quorum - the more machines you need to wait for, the more likely one of them will be outlier-slow.
You'd potentially be better off choosing a quorum algorithm that requires less than a majority - because that'd mean, in the best case, fewer responses to wait for, lowering the probability that one of those members will be very slow. There are drawbacks to this - it makes fault tolerance and provisioning harder to calculate - but it's got some cool potential benefits.
I would argue that by the time you've chosen Paxos or some other majority quorum commit protocol, you're already well aware that you're building a CP system, and that availability and latency aren't your main concern. A majority quorum is basically the most obvious (and somewhat brute force) way of providing serializable consistency in the system.
The one non-majority quorum commit protocol that most people are probably already familiar with is the "sloppy quorum" replication in Dynamo systems[1] (e.g. Cassandra, Riak, Voldemort, etc.). Basically, since the quorum is configurable on a per-cluster basis instead of being inherent to the protocol, and usually isn't a majority of the cluster, the system can still make progress when half of the nodes are unreachable. (But of course, as the paper notes, this means that you need to resolve conflicts some other way, which adds a whole bunch of complexity.)
> you're already well aware that you're building a CP system, and that availability and latency aren't your main concern
Assuming you've chosen correctly between CP and AP approaches, this tells us that availability and latency aren't as important as consistency. But there's nothing that says they aren't arbitrarily close...
Yeah, definitely -- I agree that the decision doesn't mean to just to blindly throw away availability optimizations once you've decided that consistency is important.
Actually, invoking CAP probably didn't add to my message. What I meant to say is that people don't talk about non-majority quorum commits that much because the interesting part is that the serializability comes with majority/overlapping quorums.
As I read it, the comment you were replying to was still restricting its discussion to overlapping quorums, and merely pointing out that that's not actually synonymous with majority.
Oh man, that is a really cool paper, thanks a bunch for sharing that. I've got next week off and lots of itch to try Go for network code.. might try this out!
That's pretty interesting. Is there a concrete example of a quorum definition where the probability of an outlier is improved vs majority quorum? I'm struggling to come up with one. I've always assumed majority is optimal, since you can tolerate outliers in (n-1) / 2 voters without seeing an outlier for the commit overall. Eg: for 3 node Raft - you'd need both followers to have an outlier before a client notices a slow commit.
This simplified algorithm doesn't distinguish between learners (readers) and proposers (writers) to the value. I'd say this conveys the core ideas of paxos, and it makes sense to treat learners as a (performance-critical) extension/optimization, just like the many optimized paxos variants in the literature. Another benefit of treating read/write as a single operation is that it serializes reads and writes (e.g. in a distributed log).
I realize this is pseudocode, but I still feel like the bigger challenge is not in implementing a theoretically correct Paxos, but a production-ready one. It's probably pretty well-known the Chubby[1] team's experiences dealing with unexpected complexity from using Paxos in production.
A choice quote: "While Paxos can be described with a page of pseudo-code, our complete implementation contains several thousand lines of C++ code."
When I hear about these algorithms taking many thousands of lines of code in a "low-level" language like C or C++, I wonder how much of that could be simplified away if you didn't need to manually manage memory. Performance aside, how much of those "several thousand lines" would be unnecessary in a higher-level language?
I implemented Raft in a couple hundred lines of succinct JavaScript a few years ago. I can only imagine someone smarter than me could write a production-ready Paxos implementation in less than a thousand well-commented lines of JavaScript or Python.
> I implemented Raft in a couple hundred lines of succinct JavaScript
But is it production-ready? :)
None of the extra complications described in the paper were inherent to C/C++. It covered things like leader leases, log compaction, handling disk corruption, and group membership changes -- optimizations that weren't intrinsic to Paxos itself, but still crucial for running it in production.
Another choice quote from the paper: "There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions."
Also, a random data point: etcd's Raft implementation stands at about 4000 lines of Go right now, not including tests.
I also didn't mention Go in my post because--despite having managed memory--it's syntactically very long. Not a complaint, but all of the Go code I've seen and written tends to be "taller and skinnier" (less dense?) than the code I've seen and written in other languages like Scala or Python.
The paper linked anticipates your question in the sentences after grandparent's quote:
> The blow-up is not due simply to the fact that we used C++ instead of pseudo notation, nor because our code style may have been verbose. Converting the algorithm into a practical, production-ready system involved implementing many features and optimizations – some
published in the literature and some not.
1 proposer(v):
2 while not decided:
2 choose n, unique and higher than any n seen so far
26 lines.
It's pseudocode, so not really only 26 lines as it needs some more supporting functions to "choose n, unique and..." and other stuff to make setting variable states atomic.
Number of lines is the most ridiculous metric anyway. Most languages have no line length limit, just replace all newlines with semicolons, and you have a one line program!
the ANSI C standard has a line length limit, so there are no guarantees that compilers have to function properly with longer lines than given in the standard.
For me this shows the difference between theoretical setting and what you would want to do in practice. I have been following 6.824 (where this is sourced from), to learn something about distributed systems programming and it was great fun to shed a lot of figurative sweat to convert those 26 (actually) lines into working "production" code. Hundreds lines of code, because in real-life we have packet loss, network partitions, etc. But the pseudo-code in the link itself is correct, however, it doesn't tell the whole story.
Finally - I wholeheartedly recommend the 6.824 course to anyone interested in distributed systems. Even if you don't like strong consistency, you'll learn a lot about testing and debugging distributed systems, the knowledge you can re-use later in your career.
This is what I mean when I tend to say that all scientific papers should have a minimal reproducable working sample with instructions attached.
Lets say I am interested in dam building with turbines and all its glory: One would assume that this is really complex cross cutting tech, but I still firmly believe that if you cant show me how to build a tiny sample dam that powers my mobile phone or my computer, you havent done your part to make your theory sufficiently reproduceable.
Very succinctly put, regardless if it is a function call or a builtin statement :)
I have used something similar to defuse endless arguments about which language is more expressive, or better, and turn it into a more productive discourse.
I simply make a tentative assertion that there is a perfect language for every problem, one where only one line of code is needed to solve the problem, it reads as follows: doit
Then I follow up with stating that the language is probably rather useless for anything else.
I don't know why it usually works to open up the discussion, it seems to me as such a trivial and obvious observation, but apparently the perspective is something many rarely come to observe without prompting.
I'm well aware that 'doit' can't really be considered to be a language, except in a very limited sense, it can also simply be a function call, which maybe helps to bring into focus the intersection between language, libraries and their relative applicability to the task that needs solving, and the environment it must be solved in.
Trivial, obvious but somehow deeply at the heart of writing the correct code to solve a particular problem, because everything is a tradeoff somewhere between extremes.
That process can of course be optimized in a number of ways that drastically cut down on the network overhead as compared to the naive MxN write pattern, but what's written here is not safe on its own.