He's quite correct that you can write programs for which there's no faster way to find out if they halt than running them. His example is one that tests Goldbach's conjecture by brute force. A more practical example is trying to break a crypto key by brute force trials. Or a Bitcoin miner.
His analysis applies to the worst case. Most programs aren't like that. If your program is that close to undecidability, you probably need some kind of a loop counter to keep it under control.
Microsoft has a Static Driver Verifier for Windows kernel drivers. It's a formal proof of correctness system using symbolic execution. It tests whether a driver will halt (return, really), call any of the driver APIs incorrectly, or violate the execution model by subscripting out of range, using a bad pointer, or other crash-type error. It doesn't try to verify that the driver will run the device correctly, just that it won't take the entire kernel down. This is the main reason that driver-caused crashes stopped being a major source of Windows problems in Windows 7 and later. Before the Static Driver Verifier, more than half of Windows crashes were caused by driver bugs.
Microsoft's experience is that 5% to 10% of the time, the Driver Verifier can't decide if a program will get to a bad state. The Driver Verifier has explored too many cases and gives up. This is usually an indication of a flaky driver. If the Verifier can't tell if the driver will hang, the programmer who wrote it probably can't either. So the driver needs to be fixed. This does not turn out to be a serious problem in practice.
(I've worked on proof of correctness systems. There are lots of hard problems, but this just isn't one of them.)
I'm the speaker. First of all let me say that much of what I do these days formal specification and verification, so I am aware of what's possible in practice and what isn't, and the blog post does go into some detail. Also, I am well aware of Microsoft's great accomplishments with driver verification -- SLAM was a landmark project -- but those results don't seem to translate well to other domains. As I say in the beginning, the main purpose of the talk is to explain why the problem is so hard, not to list the special cases that are feasible in practice (although I do quickly go over some approaches).
Nevertheless, it is not true that all the complexity theory results I present are worst case. Of particular interest is the result by Schnoebelen from 2005 (25:16 in the talk, but it's better to look it up in the post) that showed that verification isn't FPT (fixed-parameter tractable), which is very strong evidence that we cannot find a good method way to restrict programs or structure them in a certain way -- even in practice -- so that verification becomes tractable. Another is the result Schnoebelen quotes of symbolic model checking (28:20, but better in the post), i.e. the ability to use syntactic constructs and abstractions to aid in verification), which, as he says, has been shown to hold in practice and is now considered an empirical fact. Of course, the fact that feasible verification of nonlocal properties is unsolved and that virtually all efforts attempt to tackle special cases (or don't care about feasibility), shows that the problem is very hard in practice, and I talk about that towards the end of the talk/post.
Also, the Goldbach example has little to do with brute force. It is about the difficulty proving a seemingly simple reachability problem on a 10-line program. That the program itself uses brute force is irrelevant, and only used to make it simple and short.
There are lots of things you can verify if you don't choose to bang your head against unsolved problems in number theory.
The first stage is simply verifying that the program stays within the language - no integer overflow or underflow, no subscripts out of range, no bad pointers. If you can't do that for a program, the program is just broken. It should be possible to do this for much "unsafe" Rust code, such as collection classes. To deal with collection classes, you need to be able to talk about an array for which only some of the elements are valid. Rust lacks syntax for that, but many verification systems have ways of talking about such things. Simple inductive proofs can handle situations such as growing or compacting a collection.
The next step is verifying validity semantics for data structures. Is a tree valid before and after an insertion - that sort of thing. There are good ways to do that. One way is to add information to the tree, such as back pointers and depth counts, and verify that the updating operations maintain the invariant that back and forward pointers are consistent, and that depth counts decrease towards the root. That info doesn't have to be carried along into the executable code if it can't affect the results. We did this in the Pascal-F verifier, and it was easy to explain this to programmers as a way of inserting checking code that had no overhead.
Specifying the problem formally remains tough. There are formal specification languages. I used one of the first, SRI's SPECIAL. They're useful for some kinds of problems. Industrial control often has well-defined constraints on the overall system. Databases do something which can be simply defined; consider a database as a tuple store with linear search. An efficient implementation of a database is quite complex, but the desired end result is defined by a simple model.
Security properties can often be addressed formally, and DoD has put a lot of work into that. Anyway, enough for tonight.
From the paper: "I hope it’s clear by now that it’s unlikely we’ll find a general useful category of programs for which verification is affordable, as verifying even the most simple computational model, the FSM, is in general also infeasible."
The basic approach of this paper is to generalize the problem until it becomes unsolvable. Others less inclined to claim the problem is impossible have succeeded in verifying some useful systems.
Useful reading:
"Formal Verification of floating point RTL at AMD."[1] After the famous Pentium floating point bug, AMD decided to use formal proofs of the correctness of their FPU, starting from the Verilog representation used to create the chip. They found several bugs, which were corrected before making the chip.
L4 microkernel verification: "The binary code of the seL4 microkernel correctly implements the behaviour described in its abstract specification and nothing more. Furthermore, the specification and the seL4 binary satisfy the classic security properties called integrity and confidentiality."[2]
I remember that old ACL2 paper, they didn't perform verification on just any RTL module written in Verilog, they restricted it to a subset of Verilog that most matched their ACL2 language, it was a lisp like functional language, and their ACL2 system was descendant of the Boyer-Moore thereom prover Nqthm. Anyway they translated that subset of Verilog into ACL2, and then their prover would try do verification by using term rewriting. To use their prover you had to basically find a functional version of whatever program you wanted to verify. On top of that, there was no guarantee that it would perform verification successfully, you would either get a result or ACL2 would just hang/or fail! Its mode of use is very similar to Microsoft's SLAM you mentioned earlier, if it took too long then you really had no choice but to rewrite. Verifying any Verilog module was way outside its capabilities and by today's standards it wasn't a very sophisticated mechanical prover.
Anyway, the way verification is done these days is too reduce the state space as much as possible and check the FSM, that's not "generalizing the problem until it becomes unsolvable". There are techniques post 1990's to do just that for example abstractions and minimization, partial order reduction, and more famously symbolic model checking. But user pron is right, most useful programs in production are just not feasible to formally verify because even after using many of the state of the art techniques to reduce the state space of the resulting FSM, the FSM is still too large. The option you have today, if verification is important to you, is to structure your program so that its verification by mechanical provers is tractable.
I apologize for any typos/grammar errors, I'm on my phone. Also for anyone interested in the history of model checking check out Turing Laureate Edmunde Clarke's paper "Birth of Model Checking", if your interested his textbook "Model Checking", and a more recent text is "Principles of Model Checking" by Baier&Katoen.
BTW, one of the coolest modern model checking techniques is "counterexample guided abstraction refinement" (CEGAR) which automatically soundly reduces (abstract) the state space, tries to prove the property, and if it fails it refines the abstraction (enlarging the state space) and tries again.
> The basic approach of this paper is to generalize the problem until it becomes unsolvable
This is simply not true. The talk doesn't say it's unsolvable. It explains why it is hard. It is based on results of researchers in the software verification community. Unfortunately, I don't know how to teach a complex subject without starting with the general and then moving on to the specifics -- which is exactly how the talk is structured: from decidability to complexity, to total programs, to FSMs, and finally to verification methods. How can I explain the idea of reducing the state space by abstract interpretation and type systems, before explaining why the size of the state space is what matters most? If you have a better way of introducing a theory, I'd love to hear it. Honestly, I have no idea how you'd understand verification tools if you don't know that, say, verifying an FSM language is PSPACE-hard. That's at the very core of verification and the design of the tools.
The last two (out of four) chapters of the post explore what is solvable and how. This is how chapter 3 beings:
Now we know why writing correct programs is hard: because it has to be. But while we cannot verify all programs all the time – regardless of how they’re written – there’s nothing stopping us from verifying some programs some of the time for some properties. As we’ll see, none of the software verification methods manage to overcome the limitations we have explored, and it is precisely those limitations that affect the design and tradeoffs of every verification approach.
The idea is that if you understand the problem, namely that the size of the state space is what determines the difficulty of verification, you'd understand how those methods work when they do and why they fail. Very broadly, there are two approaches: reduce the state-space in a sound way (this is called abstraction) and hope that the property you wish to verify is captured by the abstract state space (this is how SLAM works), or not, and work really hard.
If you claim that verification is easy, then you're wrong both in theory and in practice. If you claim that some useful properties are easy and some may be verifiable with a lot of effort, then you're in complete agreement with experience and what I said (which is no more than a summary of known results in verification).
> Others less inclined to claim the problem is impossible have succeeded in verifying some useful systems.
In my job, I formally specify and verify large, complex distributed systems. My success comes from an understanding of the difficulty, which helps guide the way towards what's possible, rather than from wishful thinking. Just as you can't succeed in programming if you don't understand the difficulty (and therefore the importance of readability and structure, for example), if you don't understand the difficulty in verification, you simply will not succeed. You'll either bang your head against the wall or fail to understand how to help the tools.
> L4 microkernel verification
The effort that went into writing the seL4 -- a mere 9KLOC C program -- was, IIRC, over 20 person-years. Someone on that team told me they could have done it in 5 if they had had all the tooling in advance, but we're still talking about significant cost (5x instead of 20x), carried out by a team of (rare!) experts trained in the task. Plus, seL4 was "dumbed down" significantly so that verification would be viable at all, even at that effort. They didn't verify a general micro-kernel, but wrote a much simplified microkernel so that it would be amenable to verification.
For seL4, as I understand it their trials and tribulations also included false starts, like trying CompCert before moving to Magnus Myreen's generate logic from binary code and then meet in the middle approach.
And they've released all this, GPLed, so hopefully others (including maybe myself when I get the time) can continue doing this more efficiently.
Absolutely, but if you ask them, they will tell you that they don't think their process could be done efficiently enough for general purpose software. More importantly: it shouldn't be. End-to-end verification is absolutely unnecessary for all but the most critical pieces of software. No one believes this can be done easily.
When doing software verification we must keep our eyes open, know that it is hard and why:
* There is no general solution
* The difficulty is not a result of code structure, organization or syntactic abstractions.
* The difficulty cannot be avoided with languages -- through totality or abstractions (although some aspects regarding local properties can certainly be aided by language design; see below).
* The difficulty cannot be modularized away.
* The main issue is the size of the state space.
So the main idea is reduction of the state space (and heuristic approaches -- like SAT solvers -- that allow us to tackle larger state spaces). For reduction of the state space to be useful we need to find properties that survive the abstraction and are still important (memory safety is a classic example). The more local the property, the greater the chance that it can survive reductions. Which local properties are important is an empirical question which must be studies.
As for the rest, it's up to careful design, use of simplified algorithms, a lot of grunt work, and a lot of experience, which often involves a manual reduction of the state space.
>> [Quoting Turing] We shall need a number of efficient librarian types to keep us in order.
Uh... I think "librarian types" means "people who think like librarians". It's a common colloquialism in some places: you have your entrepreneur types, your manager types, your artist types, and your librarian types.
Of course he meant people, but it may surprise some to learn (although it shouldn't; he wrote about a lot of topics) that Turing actually did write about type theory: https://projecteuclid.org/euclid.jsl/1183410407
Oops, sorry! I thought for a moment that it might be a joke, but I couldn't see any other indication of that. Probably it came across better on the video than the transcript.
A driver is a pretty small program, and as you say, the "verifier" ("checker" would have been a better term) doesn't even try to verify that the driver will run the device correctly.
Actually proving correctness of large programs like compilers and operating systems is not considered an easy problem. You surely know that; I don't know what you're thinking when you dismiss the problem.
What proof-of-correctness systems have you worked on?
Proof-of-correctness is an irrelevent concept. You can't specify the correct solutions to all edge cases for non trivial problems. You can however add constraints and test for those especially when you are willing to deal with a bunch of edge cases manually.
First time I did static analysis I knew there was a lot of edge cases I could detect, but not deal with. But, in practice my code dealt with 99.97% percent of cases which was easily enough be be extremely useful.
He's quite correct that you can write programs for which there's no faster way to find out if they halt than running them. His example is one that tests Goldbach's conjecture by brute force. A more practical example is trying to break a crypto key by brute force trials. Or a Bitcoin miner.
His analysis applies to the worst case. Most programs aren't like that. If your program is that close to undecidability, you probably need some kind of a loop counter to keep it under control.
Microsoft has a Static Driver Verifier for Windows kernel drivers. It's a formal proof of correctness system using symbolic execution. It tests whether a driver will halt (return, really), call any of the driver APIs incorrectly, or violate the execution model by subscripting out of range, using a bad pointer, or other crash-type error. It doesn't try to verify that the driver will run the device correctly, just that it won't take the entire kernel down. This is the main reason that driver-caused crashes stopped being a major source of Windows problems in Windows 7 and later. Before the Static Driver Verifier, more than half of Windows crashes were caused by driver bugs.
Microsoft's experience is that 5% to 10% of the time, the Driver Verifier can't decide if a program will get to a bad state. The Driver Verifier has explored too many cases and gives up. This is usually an indication of a flaky driver. If the Verifier can't tell if the driver will hang, the programmer who wrote it probably can't either. So the driver needs to be fixed. This does not turn out to be a serious problem in practice.
(I've worked on proof of correctness systems. There are lots of hard problems, but this just isn't one of them.)