Formal Methods in Building Robust Distributed Systems

ScottBurson · on July 3, 2014

The linked paper is quite interesting [0]. It does indeed sound like TLA+, the formal methods tool set they used, has worked out very well. One quote: "[W]e have found that software engineers more readily grasp the concept and practical value of TLA+ if we dub it: Exhaustively-testable pseudo-code. We initially avoid the words ‘formal’, ‘verification’, and ‘proof’, due to the widespread view that formal methods are impractical."

[0] http://research.microsoft.com/en-us/um/people/lamport/tla/fo...

nmrm · on July 3, 2014

Here's an interesting blurb about how they were using TLA+. TL;DR is is that, unsurpisingly, model checking trumps proving in roi:

We have found formal specification and model-checking to yield high return on investment. In addition to model checkers, many formal specification methods also support formal machine-checked proof. TLA+ has such a system. The TLA+ proof system has several compelling benefits; for example, it supports hierarchical proof. After doing only a small number of such proofs, author C.N. has found that hierarchical structure is an effective way to handle the otherwise overwhelming amount of detail that arises in formal proofs of even small systems. Another benefit of the TLA+ proof system is that it takes

as input the same specification text as the model-checker. This allows users to find most of the errors quickly using the model-checker, and switch to the proof system if even more confidence is required. Even though formal machine-checked proof is the only way to achieve the highest levels of confidence in a design, use of formal proof is rarely justified. The issue is cost; formal proof still takes vastly more human effort than model-checking. We are continuing to experiment with the TLA+ proof system, but currently model-checking remains the sweet spot in return on investment for our problem domain. We suspect that proof will only be a worthwhile return on investment for one or two of the most critical core algorithms.

pja · on July 3, 2014

r/programming discussion of this paper here: http://www.reddit.com/r/programming/comments/277fbh/use_of_f...

amund · on July 3, 2014

The Envisage Research Project - http://envisage-project.eu - is working on developing formal methods for software engineering for the cloud, ref: http://envisage-project.eu/wp-content/uploads/2013/10/Envisa... "ENVISAGE will create a development framework based on formal methods to include resources and resource management into the design phase in software engineering for the cloud. This will improve the competitiveness of SMEs and profoundly influence business ICT strategies in virtualized computing"

tlarkworthy · on July 3, 2014

I have used Computational Tree Logic before (https://www.firebase.com/blog/2014-02-04-firesafe-complex-se...)

I wonder if anybody knows what the main differences between CTL and TLA are. Maybe I should switch camps?

EDIT: oooh, you can read the book for free http://research.microsoft.com/en-us/um/people/lamport/tla/bo...

EDIT2: Ahh... TLA has sets for one thing

jzelinskie · on July 3, 2014

Can someone explain give an introductory run down on model checking vs theorem proving and why someone wouldn't just want to write these systems/algorithms in a dependently typed language?

yodsanklai · on July 3, 2014

Here's my understanding:

In model-checking, you describe an abstracted version of your system and you can automatically check some properties. Your model has to be small enough (in term of number of states) to be processed by the tool, and the class of formulas you can express may be limited. Typically, you can check safety properties (such as assert statement in your code).

On the other hand, using a theorem prover (I think "proof assistant" is a better term, you can work on more accurate representations of your system and express any properties, but most proofs have to be done manually which requires time and expertise.

As for dependently typed language, I think most of them are not expressive enough to express complex properties such as "the system will eventually reach a consensus if no more than half of the process fail". And those that are are more akin to theorem provers.

nmrm · on July 3, 2014

Since others have described the high-level difference between checking and proving, I will point out that you might not want to use a dependently typed language even if you plan on doing some proving.

A large cost of formal verification is designing and implementing specifications. For this reason, even if you plan on proofs for some components, it still makes sense to use a tool which supports both model checking and theorem proving using a single specification. Especially if your interests are non-academic.

Type-theoretic proof assistants typically don't support model checking (or, it's at least fair to say it's not in the culture even where supported). So to use model checking, you have to re-implement the specification.

oggy · on July 3, 2014

I'll give it a try, briefly. For both, you first need to give a mathematical meaning to your system, i.e. represent the system as some kind of well-defined mathematical object.

A typical choice (especially for concurrent/distributed systems) are transition systems. To create a transition system, you first need to describe the state of your (entire) system. This could, for instance, be the current values of all variables used in all processes of your distributed system and the set of all messages currently in the network. Next you need to model transitions; ways for the state to evolve. These would look something like "process p_i increments its variable x from 0 to 1, and sends a message to process p_j".

Given such a transition system you can now ask different, well-defined questions about it. The simplest kind are "safety" properties, where you ask whether your system can reach a state which is "bad" by some criterion. In a distributed consensus algorithm for instance, you could ask "can it be that two replicas decide on a different value?", or "can it be that my processes are all in the waiting state (deadlock)?". You need to phrase this question in some kind of formal language the tool can understand.

Model checking and theorem proving then go about different ways to answer the question. Model checking, roughly, tries to use brute force to answer the question and requires no human interaction in doing so. You could imagine it feeding every possible input to every process, choosing every possible interleaving of messages and, for every state reachable in such a manner, checking whether the bad thing happens.

In theorem proving, you try to provide the rationale of why things can't go wrong in form of theorems. This often follows the informal reasoning. E.g. suppose we're in the "earliest" bad state s, then previously X must've happened, which must've been caused by Y previously happening, which must've happened in some state s' which was also bad but earlier than s, which is a contradiction. However, you also have to convince the theorem prover that your reasoning is sound. So first you need to understand what methods of reasoning you are using precisely (e.g. my example essentially relied on induction, which might not be clear to most people), and you also need to somewhat understand the way of how the prover "ticks" and what kinds of reasoning steps it can perform automatically.

So theorem proving requires more training and effort than model checking. On the flip side, model checking typically works only for finite systems. So your model (transition system) could only have, say, 3 or 4 processes, and the range of the variables would be restricted (say my variable x could only range between 0 and 4), as would be the number of messages in the network. Obviously, this doesn't give you such strong guarantees - maybe your system works fine for 4 processes, but does something wrong when you have 5 of them.

Depending (hah!) on the type system used in your dependently typed language of choice, those languages are actually a kind of theorem provers! This includes Coq, but also Idris and Agda. They exploit the so-called Curry-Howard correspondence, which equates mathematical propositions to (dependent) types, and proofs of the said propositions to terms (programs) of the given type.