Could you go into greater detail about how the only known-good implementation of...

felixgallo · on Oct 5, 2015

First, a correction -- the only known-good implementation of Raft is actually in Coq (https://github.com/uwplse/verdi), which predates Idris (which intends to be an easier-to-use Coq).

So the problem with distributed consensus algorithms is that they are hard to understand. It didn't help that Lamport wrote his original paper playfully, using a complex and unfamiliar metaphor. But as a result, many implementations of the relevant algorithms tend to miss complex edge cases. Even famous ones that many large companies rely on have either had meaningful serious bugs or have been misunderstood and misused by downstream applications.

There are a couple of ways to try to fix this. The common way is to try to write a bunch of unit tests. This doesn't work. Unit tests test only those things that your tests manage to cover, and you will probably not think of all of the edge cases.

The next most common way is to use something like QuickCheck, which automatically generates millions of cases and spends as long as you want (days, hours, weeks) hammering your code. This is much better, but still nondeterministic.

The better way is to go fully deterministic, and prove out that your algorithm works, either by exhaustively checking all possible interleavings (code paths) with a model checker, or by mathematical proof.

Historically, the model checkers (e.g. Spin or Groove) have used a pseudolanguage that you describe your algorithm in, which is then exhaustively run to completion. This approach can prove that you are on the right track, but since you cannot run those pseudolanguages in production, they are not the complete solution, because you must then translate the pseudolanguage into source code of your chosen language. This is nontrivial and very frequently there are subtle transcription errors.

An alternative approach is to use a model checker that uses your native language directly; e.g., Erlang has the brilliant Concuerror program, which instruments your actual code. This is great because if it can verify that your code works properly, then you are done; no transcription is needed. Nevertheless I don't believe there are yet any Concuerror-assisted public distributed consensus algorithms, even in Erlang. I would love to be mistaken on this point.

The last approach is to take a proof assistant language like Coq or Idris, and formally prove out the properties of the algorithm using mathematical proof techniques. This is probably optimal, because the exhaustive model checkers can, with complex/slow enough algorithms, run forever or run out of memory trying to test all the cases. However, Coq and Idris are not exactly popular languages and at this time it's not easy to implement line of business applications with them. So although there is a proven, correct implementation of Raft in Coq that guarantees linearizability, good luck accessing it. If you don't use Coq, then you're forced to transcribe it to your chosen language, which, as before, is error prone and does not result in a proven implementation.

It would be possible to mechanically transcribe Coq/Idris code into a more common language while maintaining its provability, but to my knowledge that hasn't happened yet. More likely is that Idris and its successors inherit mainstream language features and start making inroads.

Note also that maybe you don't care about being correct. For example, in the spirit of moving fast and breaking things quickly or whatever, at least one major notable VC-funded project in the news has taken the approach of just increasing timeouts in order to mask bugs in their implementation. And 99.999% of the time that will probably work fine, just as centralizing your database into one king-hell instance and just dealing with downtime every blue moon will probably be fine too. Own your own reliability alpha.

liveoneggs · on Oct 3, 2015

ants (the bug) is an analogy for erlang workers