"What makes compilers hard to write, for someone who has never done it before, i...

Goladus · on Aug 27, 2011

> I think it is useful to distinguish work from difficulty. IMO, on current hardware, writing a compiler is easy and, for most languages, not too much work.

I disagree, writing a compiler is a lot of work, it's just easy work for people with experience. An expert climber might climb a modest mountain in 6 hours where a novice might take all day and, assuming they don't get lost, still only make it halfway up before turning back.

Put yourself at the keyboard of someone who has never written a lexer. This programmer's idea of text processing is to use find-and-replace, shell globs, or maybe an ad hoc regex with grep or perl. They are used to programming sorting algorithms or writing polymorphic triangles and squares that inherit from the shape class. Their idea of a hard programming assignment is to solve the 0/1 knapsack problem with dynamic programming, where, as hard as it may be, the answer is still just a couple of functions.

They have to write code to accept the entire character set. Every single one of the letters, numbers, special characters and whitespace that is valid in the language must be handled, or else the lexer isn't going to work properly.

You have to write code to recognize every single keyword, distinguish identifiers from literals and different types of literals from each other; and you must drop comments. Generally it's a good idea to also tokenize every single operator. Even small languages can easily have dozens of keywords and operators, plus a number of different types of literals, many of which the programmer may have never used before (Bit shift? Octal literals?). This means writing the lexer will involve frequent references to a language specification they have no experience reading.

And that's just the lexer, easily the most straightforward part of the initial phases.

This seems like nothing to someone who has done it before but I assure you it is not for a novice. While none of it is supremely difficult, indeed many algorithms are much more difficult conceptually than any one piece of a lexer, there are a lot of little details that must be addressed and it's a lot of grunt work if you have no practice doing it.

If you read the Crenshaw tutorial, you'll see that he chooses a language that does not require much (any) work to lex. The language you learn to compile has no whitespace and uses single-character keywords and identifiers. This lets him delay lexing to chapter 7, but as you can see for yourself-- that chapter is pretty long-- 2300 lines and a lot of it is code.