I actually (as a big C and C++ fan) quite a damning result for C.
Its easy to say "just programming well" when bugs occur in badly written projects, but SQLite is so well tested, and generally considered well written.
As others have said, parsers for complex formats (be it text-based or binary) are exceptionally hard. There are some classes of C/C++ software where the choice of a language doesn't have such a striking effect. For parsers, the effect is hard to ignore.
But parsers are also the kind of stuff you almost always end up writing in C/C++, and there are semi-compelling reasons for doing so - chiefly, performance and flexibility. You can disagree and make your pitch for Ocaml or JavaScript or whatever, but really, if we had clearly superior choices, we wouldn't be dealing with this problem today (or it would be a much more limited phenomenon). There are some interesting contenders, but the revolution won't happen tomorrow, no matter how much we talk about it on HN.
Perhaps a more fitting conclusion is that if you are parsing untrusted documents, our brains are too puny to get it right, and the parser really needs to live in a low-overhead sandbox. Mechanisms such as seccomp-bpf offer a really convenient and high-performance way to pull it off.
I agree with sandboxing, but a problem is that by using something like seccomp-bpf, you're turning highly portable C code into a highly platform-specific and even architecture-specific system. (Well, it won't be architecture-specific if you use the old plain seccomp, which is probably all you need.)
You also need to deal with serialization and deserialization of the parse tree, which isn't super hard, but people still get it wrong.
Writing parsers is already hard, but writing secure parsers is at least 2x harder than that, or maybe 4x if you have to port it to 3 platforms. There really needs to be some kind of "libsandbox" for parsers.
I think that Haskell's parser combinator libraries, like Parsec, are clearly superior for most parsing tasks.
Parsec is about as concise as a formal grammar, but it is real code rather than a code generator, so you don't have the extra complexity of a parser-generator. Parsec parsers are type-safe, so there's no way that you'd get something back that wasn't a valid AST (or a parse error). Error handling is also quite good.
Parsec in particular is not especially fast - there are other libraries like attoparsec and cereal (for binary serialization) which trade off some of the flexibility of parsec for improved performance.
I think Haskell programmers mostly use these libraries, because they are clearly superior to the alternatives (for instance, you get a lot less interest in RegEx in Haskell-land, because when you have a really good parsing library there's much less reason to use them). C programs like SQLLite aren't using this approach because monads can't be expressed in the C type system, and there is no syntax sugar for monads in C, so Parsec would end up much less pleasant and much less safe than in Haskell.
Many of the horribly vulnerable parsers are generated with Bison / Flex, so it's not exactly a robust solution. Plus, especially for binary formats (images, videos, etc), it's hand-written or bust.
I agree, but lexers and parsers are especially stateful pieces of code.
I'm not a static typing person (I like Python, bash, and C), but I have resolved to write all future parsers in OCaml or a similar language.
For other types of code, if you don't have a code path explosion, you can do OK in C. For example, numerical simulation code.
Unfortunately async code in C like Nginx tends to also produce this code path explosion. And kernel code is also littered with state machines.
Another interesting thing is that SQLite's parser is generated with Lemon. It's not even a hand-written parser in C (like Lua), which would be much worse. I guess all the crashes were in the semantic actions of the parser?
I've written parsers with ANTLR, lex/bison, and other tools, and I've hand written parsers.
With the generators, it's very hard to get several features of a high quality parser right. In particular error messages and error recovery, but also handling some more esoteric grammars (in particular ones that are not entirely unambiguous, LALR(1), etc). Using a parser generator also complicates your build chain, complicates refactorings, usually means you get bad source tooling, and often slows down the edit/test cycle quite a bit.
At the same time, if you're doing it right, writing a simple LL(1) parser by hand really isn't that hard, and the resulting code doesn't have to be much longer than the spec in your generator's language. Even in languages that are not that DSL-friendly (e.g. Java), you can end up with very compact, readable code. Plus you get all your usual tooling to do so (e.g. IDE support).
Error messages and error recovery are to some extent red herrings. You really do not want a production system to accept wrong inputs (and especially not try to 'recover' them into a usable parse tree). This will definitely make your parser ambiguous and can have very real security implications (say, microsofts anti-xss transformations introducing XSS).
As to LL1 - many real world protocols can't be parsed efficiently as LL1.
Error messages really are core feature of parsers. Would you want SQLite to just "return false" whenever there was some syntactical error somewhere in your query?
With error recovery in the context of a parser I mean the ability to continue parsing with predictable results after encountering an error. This is not about automagically correcting errors, it's about being able to report more than one error to the user. Returning just the first error encountered sucks, as does returning a slew of non errors caused by your parser getting messed up.
As to LL1 - many real world protocols can't be parsed efficiently as LL1.
Not sure what you mean with efficiently - a language either can or cannot be parsed as LL(1) because it's in that language class or not. But in any case, it's still very straightforward to make LL decisions with a longer lookahead in hand-written code, and the decision code is often more efficient than what a generator would create.
The way I write parsers nowadays is to write a pseudo-code description of the grammar, then treat that pseudo-code as data that must be transformed into a valid parser.
The end result is a fast, memory-safe parser in C or C++, and a whole lot of ugly python code :)
I used to entirely use hand-written recursive-descent parsers, until I discovered packrat style parsers. They are crazy fast and so easy to write. Also, the parser generator I use (esrap) lets you fall-back on recursive decent for parsing rules.
Do you have a link/citation for that? I always thought they were asymptotically linear due to the memoization, so comparable to LL, LALR etc, but in practice much slower due to the overhead compared to those more traditional approaches.
That's interesting -- do you have examples somewhere?
Are you saying you write an parser-specific parser generator in Python for every problem, with the goal of producing readable C or C++? (As opposed to the sometimes unreadable output of parser generators.)
I think it'd actually be better to use a hand-written recursive-descent parser. With RD the state is implicitly in the position of the code and the stack, so there's no need to explicitly manage it. AFAIK the SQL grammar isn't unusually difficult to parse.
I've used Flex/Bison before, and it makes it easy to get started, but with more complex parsers, threading the state around quickly gets irritatingly annoying and I could see how bugs could arise from it.
It's basically advocating the ML subset of OCaml, Haskell, and now Rust. Those languages all feature algebraic data types and pattern matching, which were pioneered in ML. Rust is basically a cross between ML and C++, with C-ish syntax.
(None of C, C++, Java, JavaScript, Python, etc. have those features, and thus they are all suboptimal for writing parsers.)
A good clue is the name -- ML stands for "meta-language". It's literally a language for describing languages.
ADTs and pattern matching are basically two sides to the same coin -- data vs code. ADTs are how you describe variants in the data. Specifically, "sum types" are essentially type-safe tagged unions (if you are a C person).
And pattern matching is how you describe variants in the code. The compiler checks that you have considered all cases.
This helps you tame code with nested conditionals. Nested conditionals give rise to the state space and code path explosion that I mentioned.
Put me in the "most boring interests ever" category, but I'd be interested in a discussion of the actuarial modeling DSL mentioned in link if said write up exists.
I imagine actuaries are about the most technical "nontechnical" users there are so a DSL seems a particularly good fit.
Nonetheless, I agree. Eventually one cannot continue putting blame solely on the proximal cause.
It's too bad that given the number of extensions to x86 there have been, that it still doesn't have anything like native array bounds checking. Even something as simple as "any writes inside the stack outside my stack frame and all reads inside the stack and after my stack frame should assert" (on an opt-in until the next return) would do wonders towards preventing some of these sorts of errors, although not all.
Because individual instructions doing the same thing are faster. Though it's especially slow because it's microcoded and is left just for backwards compatibility and not trying to be fast. So modern CPUs are good at reordering, predicting branches, etc., for simple instructions and no extra silicon needs to be wasted to duplicate that.
And encoding space is limited as well.
If individual instructions are faster, that means the instruction decoder isn't as good as it should be. BOUND wasn't fast because it didn't get any use, and it didn't get any use because people thought it was slow... but now that the usefulness of this instruction is apparent, it seems to me that they should make the decoder issue the necessary uops and make it at least as fast as the (bigger, more cache-wasting) sequence of simpler instructions.
Instructions like MOVS and STOS used to be slower than the equivalent series of simple instructions, but now they're much faster. Even the more obscure and far less useful BCD arithmetic instructions have been made on-par with a longer series of equivalent instructions[1] so not doing the same for the much more useful BOUND is... unusual. The amount of silicon needed to implement MPX (which includes several new instructions) seems far more than it'd take to make a BOUND decode faster.
Because the instruction raises a software exception if the test fails. This is a rather heavy-handed reaction, although I guess it kind of makes sense from the instruction's point of view I guess.
Of course a language runtime based on using this instruction needs to capture the exception and handle it, core-dumping need not be the only option at that point.
Divide-by-zero has always been interrupt 0, and the page fault is at #14, and that hasn't been problematic...
BOUND does raise an exception if it fails but I think that's a good thing - if code is doing array bounds-checking, OOB accesses are very likely to be a bug (and possible security vulnerability) so aborting execution by default is the right thing to do.
I feel like this suggestion is a little misplaced. To use a scheme like this you must ultimately tell the hardware what the bounds are, and that layer is going to be error prone in the same ways as what it replaces.
From this vantage point I can say the abstraction that already exists for this is the page table. The MMU knows what regions are mapped and it will generate a fault for an out of bounds access. But within mapped regions the array bounds are all fiction. As they should be in the eyes of the hardware. That is for higher level layers to worry about.
I feel like the c-haters don't get that some layer of the system must work this way. By all means build your higher level stuff on top. But things like virtual memory and allocators are well understood, frankly when it gets to that level there is little reason for it to give a damn that a[] has an element count of 5. At some point something must chop up larger blocks into smaller ones, and create new bounds from nothing.
This would require using zillions of VM areas for all your language level objects, and aligning them on page boundaries. It's really not going to work.
Have a look at SoftBound and HardBound for some of the current thinking in this area that scales to object level granularity.
I think you misunderstand me. I am advocating the status quo, not for more granular pages.
Anyway from what I have seen some high level languages are actually using the MMU for some things. For example, a null check at every access would be kind of lame, from what I have seen high level language runtimes just let it fault on the zero page and handle it by raising exceptions. (I am 100% sure CLR does this, I believe JVM does too.)
On 32-bit x86, you could use segment registers to get bounds checking with overhead only at setup and not at use, effectively making bounds checks free in hot loops (http://www.blackhat.com/presentations/bh-usa-05/BH_US_05-Chi...). I believe that SpiderMonkey uses this trick. Doesn't work on x86-64 though.
Hmm... according to that page, the false positives and performance issues seem to largely derive from using the cache structure to store bounds externally. But if you manage the bounds explicitly as part of your data structures, and use BNDMOV/BNDMK rather than BND{ST,LD}X, it should be reasonably fast. What else could Intel do?
Have a read of the linked page again. A large chunk of the performance issues are because in many cases you can't just "use BNDMOV/BNDMK rather than BND{ST,LD}X".
There are a number of data structures where managing bounds explicitly has... issues. Namely, you end up with a lot of overhead. And the entire purpose of hardware support is to prevent the overhead.
Something like, as I said, an opt-in assert when you read past the end of your stack frame, or write before or after your stack frame, doesn't have the overhead. It doesn't prevent a lot of things - about the only thing it prevents is stack smashing - but it's better than nothing and doesn't have the overhead.
That will only work for direct writes to array variables on the stack, or pointers which can be proved to live on the current stack frame, but those are quite rare; e.g. in the vulnerability mentioned in the post, it's sprintf doing the overflowing write into its parent stack frame, but sprintf could be called with any pointer. If, say, you narrow the trapping region to some unused region of the stack frame to try to avoid this, you can get slightly better protection than a stack canary, but not much better. And most overflow vulnerabilities in modern software seem to be on the heap anyway.*
(Which is not to say primitives based on fine-grained tagged memory couldn't do some interesting things; OpenRISC has some form of this, but I haven't looked into it in detail.)
I don't see content on the linked page corresponding to data structures where managing bounds explicitly is difficult, other than the bnd_variable_size bit, which is described as rare.
*corrected to state that I'm only considering overflows, not, e.g., use-after-frees, which aren't related to bounds checks
Note that the SQLite authors intentionally did that. It runs on many many systems including ones with trivial amounts of RAM. They have ancient compilers. I've waged a many year campaign to get them to at least use size_t and similar "modern" constructs, but no dice.
They also only use techniques that are at least 20 years old, to avoid patent issues.
Not really. There is this list [0] of security considerations for ports, which is a good starting point.
Everybody writing in the same style [1] doesn't hurt either.
Thought the same, because if you look at what he found..
>ranging from NULL pointer dereferences, to memory fenceposts visible only under ASAN or Valgrind, to pretty straightforward uses of uninitialized pointers (link), bogus calls to free() (link), heap buffer overflows (link), and even stack-based ones (link).
.. it seems that all those bugs are not even possible in Rust.
... in safe Rust, yes. But Rust isn't perfect, we _will_ have these kinds of issues crop up, especially with regards to unsafe code. Rust will significantly reduce these kinds of issues, but not completely eliminate them. There is no silver bullet.
Let it be said that the Rust community believes in defense-in-depth. A low-level language with safe defaults is only one piece of the equation; a complement rather than a supplement to testing, auditing, fuzzing, sandboxing, and so on.
The only reason that the parser was using unsafe code is because it is an extremely old part of the codebase, and likely required the unsafe blocks to hack around some deficiency in the compiler from Rust 0.3 or alike. Recent efforts to overhaul the parser have reduced this dramatically: there are two legitimate remaining uses of `unsafe` (dealing with libc stuff), and four uses of `unsafe` that all have associated FIXMEs.
Kibwen is correct with regards to the parser, but I'd also like to mention that in Rust, you generally don't call malloc yourself. It's actually not even possible in stable Rust, though a high priority API we'd like to stabilize.
I think "damning" might be a bit much -- it's got bugs. Like everything on earth, it's not perfect.
I was going to say it's full of meta-programming too, which should allow for simple fixes to have far-reaching implications (indeed, bugs would have same effect), but looking at the repo and build, I'm not seeing signs of meta-programming in effect. Fossil[0] (also initiated by Richard Hipp, and used to (created explictily to) manage the sqlite codebase) however, is a meta-programming example. Point is: C is not untenable, still offers a lot, and has great tooling built up around it. Don't give up on it, but know it's strengths and weaknesses.
Weakness: the best programmers in the world are empirically unable to write secure code in this language. The errors that cause the insecurity are trivial to prevent via language design, but very difficult for humans to reliably avoid when not protected by the compiler.
I know you meant that as a joke but I think there's some truth there. It's so easy to become personally invested in the tools we use and instinctively defend them (even to the point of blaming outselves for our tools' shortcomings) rather than try to switch.
Who has the best automated fuzzing team... how fast can they grab this stuff (seconds from major open source repository commit time to automatic generation of functional exploit)? There's sure to be teams out there in the seconds/minutes range for a broad range of cases already.
> I realized that the developers of SQLite maintained a remarkably well-structured and comprehensive suite of hand-written test cases in their repository
Does anyone have a good writeup on getting started with afl-fuzz? I'm a Python developer and looked into it the other day for fuzzing Python programs, but gave up after a bit because of the documentation (it seems geared towards lower-level languages).
Fuzzing Python programs with AFL will be hard. AFL leverages a version of the GCC compiler to instrument (add additional code to) the resultant binaries. Because Python is not a compiled language this will be difficult. I'm sure there is something you could do to make it work.
There's a wrapper for CPython to get it to implement the tracing code, afl-python[1]. Alex Gaynor did a basic intro to it a few days ago which is a better write-up that the project itself has! Of course, this doesn't work so well with mixed C/Python code.
I did read the docs, and that's the conclusion I arrived at as well. I assumed there would be a layer that generated the inputs and checked the outputs, and that I could just use that, rather than use the full code checks, but it looks like that's not the case.
Its easy to say "just programming well" when bugs occur in badly written projects, but SQLite is so well tested, and generally considered well written.