Want to Write a Compiler? Just Read These Two Papers.

Jabbles · on Aug 26, 2011

"Just"? I make the first one 94,000 words. Has anyone actually read the papers to make sure that the OP has recommended something worthwhile? "Reading" and then coding alongside would probably take a full week's worth of time (I'd be interested to know if different).

It's a genuine question. People are recommended to read SICP all the time, by many influential people, but when a proper discussion of whether it's actually worthwhile comes up, we found a considerable range of opinions.

http://news.ycombinator.com/item?id=2846799

silentbicycle · on Aug 26, 2011

Yes. I've read them both, they're worth your time. The thing with most compiler texts is that they bog you down with a lot of stuff upfront about automata, parsing algorithms, etc., and by the time they get to the good stuff, it's hard to see how it all fits together. Crenshaw dives right in, and you quickly get a simple compiler running. You can look into all the alternative parsing algorithms, etc. later.

Also, run-don't-walk to Ghuloum's "An Incremental Approach to Compiler Construction" (http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf). I'd suggest that before the Nanopass paper, really.

After those, I recommend Niklaus Wirth's _Compiler Construction_ (http://www.inf.ethz.ch/personal/wirth/books/CompilerConstruc...) and Andrew Appel's _Modern Compiler Implementation in ML_ for follow-up, in that order. Wirth's book is a quick read, about on par with Crenshaw's. Appel's is more in-depth.

Another comment, linking several HN compiler threads: http://news.ycombinator.com/item?id=1922002

chollida1 · on Aug 26, 2011

> The thing with most compiler texts is that they bog you down with a lot of stuff upfront about automata, parsing algorithms, etc., and by the time they get to the good stuff, it's hard to see how it all fits together. Crenshaw dives right in, and you quickly get a simple compiler running. You can look into all the alternative parsing algorithms, etc. later.

I find this to be very true.

I've worked on commercial compilers and written a two of my own and I've come to believe that compiler writing is best done in stages.

- Learn a parser toolkit

- build your first AST

- learn why and how to transform the AST

- write an interpreter backend for your AST

- cross compile to another language

- write a backend for your compiler.

- optionally learn about intermediate forms and optimization, though this is seldom done on it's own.

silentbicycle · on Aug 26, 2011

If you learn compilers using Lisp / ML / etc., you can skip learning about parsing upfront and just use Lisp expressions. That's one of the reasons why the Lispers and MLers have so much cool compiler literature. If you're writing a compiler in C, you're stuck doing parsing and tree manipulation manually.

Nowadays, you could also use JSON, Python dicts, etc. for data literals. Either way - worry about syntax later.

sb · on Aug 26, 2011

While I agree with your point about starting, I think that parsing is a rather beautiful part of compiler construction by itself, with a nice theory behind it all, too.

Furthermore, if you are using Wirth style compilers, syntax-directed compilation comes rather naturally (at least to me). So I heartily second (and in fact have done so a number of times on HN) your initial recommendation for Wirth's "Compiler Construction", which is IMHO the canonical text to get somebody started. Instead of Appel's book, I find Cooper and Torczon's "Engineering a Compiler" much more comprehensive and illustrative (particularly the instruction selection and instruction scheduling parts.) Other interesting texts in the area are: Michael Scott's excellent "Programming Language Pragmatics" and Grune, Bal, Jacobs and Langendoen's "Modern Compiler Design" (both of which have a nice treatment of functional and logic programming languages [the latter one being more comprehensive])

silentbicycle · on Aug 27, 2011

I agree with you (and I've been researching Earley parsing and Pratt top-down operator precedence parsers lately), but: one thing at a time. While learning, it may be more helpful to experiment with AST transformations in Lisp and get immediate results, and wait on parsing until they get to the syntax design.

Also, too many toy interpreters/compilers mess around with syntax but have really conventional semantics. I suspect if people don't sink so much time into parsing etc. first, they may be more likely to focus on semantics instead.

true_religion · on Aug 27, 2011

I'm not so sure about that. People chance the sytax because that's an unfront difference between their language and others. Changing the semantics requires a leap for what can be perceive as little benefit (e.g. people will say "isn't this just C++ but with garbage collection?").

silentbicycle · on Aug 28, 2011

Eh. As far as I'm concerned, if the only difference is the syntax, what's the point? But I'm comfortable with the Erlang, K, Lisp, C, and Forth syntax, so "I grafted Ruby syntax on X" is pretty underwhelming.

aroberge · on Aug 26, 2011

Nice summary. Would you have any recommendation for good, easy to read, online resources on "learn why and how to transform the AST" and "write an interpreter backend for your AST"?

silentbicycle · on Aug 26, 2011

In particular, I suggest _Essentials of Programming_ Languages (http://eopl3.com/, AKA "EoPL"). Also, check out the _first_ edition - there's a really cool final chapter about compiling via continuation passing style (CPS, e.g. http://matt.might.net/articles/cps-conversion/) that got cut in later editions.

More generally - learn Lisp and/or ML, and go through any book that has you writing simple interpreters. EoPL, SICP, Appel's _Modern Compiler Implementation in ML_, "The Art of the Interpreter" (http://library.readscheme.org/page1.html), etc.

Locke1689 · on Aug 26, 2011

The most important book to avoid is the Dragon book. I know it's a classic, but it's outdated and incomplete. Appel's book is as close to a standard text as you're going to get.

AnthonBerg · on Aug 27, 2011

Not to mention badly written and laid out. Makes for inefficient reading.

onan_barbarian · on Aug 26, 2011

Good point; this topic has been discussed previously and a number of alternatives have been plausibly mentioned - Appel's book has a lot of fans.

The original article is very vague as to what the alternatives to reading a 'just' reading 94K words + a contentious paper on pass structure might be. One suspects heavyweight contenders like the Dragon Book, Muchnick, and Cooper and Torczon are being set up as strawmen.

There are a lot of compiler books which aren't suitable for the task; there is way too much in those last three books for someone who is just looking to 'bang out a compiler as quick as possible'.

Building a compiler by yourself is a noble goal, but the definition of success is pretty fluid.

Person A might build a heck of a compiler for a 'little language' by bolting an ANTLR frontend to an LLVM backend and learn a lot. Person B might hand-build a recursive descent parser and spit out a simple stack-based IR which she runs in an hand-crafted interpreter. Both would have learned a lot, in radically different directions.

timtadh · on Aug 26, 2011

I am not sure anyone would think you could learn how to write a compiler from scratch from Muchnick's book. It is a fantastic book, easily one of the best CS books I have ever read. However, the author makes it pretty clear from the get go that it isn't a book to learn to write a compiler. You should already know how to do that. It is a book to learn to write a really good production optimizing compiler. He expects a lot from his readers, and if you don't have the background you will quickly give up (as I did the first time I tried to read it).

I learned to write a compiler from the Dragon Book. I taught myself how to do it. It took me several years to become good at it. However, I consider that experience invaluable. While, the original article advocated just diving in to the compiler without a lot of background on lexing and parsing. I personally wasn't happy until I had written my own parser generator and fast regex engine (which I learn how to do from Russ Cox's excellent articles on the subject).

iskander · on Aug 26, 2011

I'm doing compiler research and I think reading both papers was a good use of my time. Compilation is a field full of poorly organized information, the authors are rare gems of clear thought.

perlgeek · on Aug 26, 2011

I have read Crenshaw's series (the first one that is recommended), and found it eye-opening how easy it is to write simple parsers and compilers.

That said he does use some obscure assembly of which I hadn't heard before, so I'd really love to see an update to a more mainstream platform.

narag · on Aug 26, 2011

IIRC there was a version of the code for x86.

telemachos · on Aug 26, 2011

A bit of Googling found this blog post[1] which lists a number of Crenshaw-based variants.

Two other quick finds: a more readable pdf version of the Crenshaw tutorial itself[2], and "an x86 compiler written in Ruby"[3] made by following along in the Crenshaw tutorial.

[1] http://blog.spacesocket.com/2011/06/24/lets-build-a-compiler...

[2] http://www.stack.nl/~marcov/compiler.pdf

[3] https://github.com/samsonjs/compiler

Goladus · on Aug 26, 2011

> Not surprisingly, the opaqueness of these books has led to the myth that compilers are hard to write.

Opaqueness of the books is not what makes everyone think compilers are hard to write. What makes compilers hard to write, for someone who has never done it before, is the scope of the problem you're trying to solve. Writing a compiler, to spec, for a non-trivial language takes a lot of WORK.

Regexes and grammars can be tricky to grok and walking abstract syntax trees can be hard as well. A hundred-pass compiler may make that part easier, but it almost certainly doesn't reduce the overall amount of work required to go from scratch to a working compiler, and that's where the "compilers are hard" reputation comes from.

(Incidentally, that doesn't mean the two sources mentioned aren't worth reading. Scanning both of them, they look excellent.)

Someone · on Aug 27, 2011

"What makes compilers hard to write, for someone who has never done it before, is the scope of the problem you're trying to solve. Writing a compiler, to spec, for a non-trivial language takes a lot of WORK."

I think it is useful to distinguish work from difficulty. IMO, on current hardware, writing a compiler is easy and, for most languages, not too much work.

However, if the language spec states what a compiler must do when given invalid programs, this can get much harder.

Writing a runtime increases both the amount of work, too, and, depending on the runtime, can make it much harder.

Finally, if you want your compiler to produce decent code and/or decent error messages, things get harder, too.

Summary: IMO, writing a compiler is easy, writing a good one is hard and a lot of work.

Goladus · on Aug 27, 2011

> I think it is useful to distinguish work from difficulty. IMO, on current hardware, writing a compiler is easy and, for most languages, not too much work.

I disagree, writing a compiler is a lot of work, it's just easy work for people with experience. An expert climber might climb a modest mountain in 6 hours where a novice might take all day and, assuming they don't get lost, still only make it halfway up before turning back.

Put yourself at the keyboard of someone who has never written a lexer. This programmer's idea of text processing is to use find-and-replace, shell globs, or maybe an ad hoc regex with grep or perl. They are used to programming sorting algorithms or writing polymorphic triangles and squares that inherit from the shape class. Their idea of a hard programming assignment is to solve the 0/1 knapsack problem with dynamic programming, where, as hard as it may be, the answer is still just a couple of functions.

They have to write code to accept the entire character set. Every single one of the letters, numbers, special characters and whitespace that is valid in the language must be handled, or else the lexer isn't going to work properly.

You have to write code to recognize every single keyword, distinguish identifiers from literals and different types of literals from each other; and you must drop comments. Generally it's a good idea to also tokenize every single operator. Even small languages can easily have dozens of keywords and operators, plus a number of different types of literals, many of which the programmer may have never used before (Bit shift? Octal literals?). This means writing the lexer will involve frequent references to a language specification they have no experience reading.

And that's just the lexer, easily the most straightforward part of the initial phases.

This seems like nothing to someone who has done it before but I assure you it is not for a novice. While none of it is supremely difficult, indeed many algorithms are much more difficult conceptually than any one piece of a lexer, there are a lot of little details that must be addressed and it's a lot of grunt work if you have no practice doing it.

If you read the Crenshaw tutorial, you'll see that he chooses a language that does not require much (any) work to lex. The language you learn to compile has no whitespace and uses single-character keywords and identifiers. This lets him delay lexing to chapter 7, but as you can see for yourself-- that chapter is pretty long-- 2300 lines and a lot of it is code.

asg · on Aug 26, 2011

I've not had a formal CS education. So the first time I had to write a mini compiler, I decided to do it by hand, for its educational value, rather than use a parser generator. Google got me Crenshaw's paper (Lets build a compiler), and I remember it was very simple to follow his code to write my own. So yes, much recommended.

kenkeenan · on Aug 26, 2011

Agreed. I don't have a CS background either and despite reading a few textbooks, I could never grok how one would write a parser for a language. After reading Crenshaw's paper, I was able to put together a quick & dirty scripting language for a text processing program I was writing.

Crenshaw's approach does have downsides; as noted in the article, there's no internal representation of the program so optimisations like short-circuit evaluation of logical operations are difficult, if not impossible...

praptak · on Aug 26, 2011

"I decided to do it by hand, for its educational value, rather than use a parser generator."

My experience was quite different. I am glad for having started with a parser generator - this got the tedious part out of the way so that I could think of how to generate code.

I later learned the theory behind various kinds of parsers, but I think that a beginner in compilers can safely leave it out.

perlgeek · on Aug 26, 2011

One of the very interesting lessons from Crenshaw's "Let's build a compiler" is that writing a simple, recursive descending parser isn't hard at all.

silentbicycle · on Aug 26, 2011

Niklaus Wirth has been saying that for quite a while. :)

His _Compiler Construction_ book is short, sweet, and has a free PDF online (http://www.inf.ethz.ch/personal/wirth/books/CompilerConstruc...). Uses Oberon.

WalterBright · on Aug 26, 2011

I find it easier to write a parser by hand than use a parser generator. For most languages (C++ is a notable exception) writing a lexer & parser is a weekend job, and is far and away the easiest part.

pja · on Aug 26, 2011

C++ looks like it ought to be OK to parse, but turns out to be completely evil. (flamebait mode on) Bit like the rest of the language really. (flamebait apologia: I spend most of my time coding C++; sometimes I even enjoy it...)

WalterBright · on Aug 26, 2011

Just lexing C++ is a major project, because of the 'phases of translation', the preprocessor, the preprocessor tokens, etc., and then trying to make it fast.

sb · on Aug 26, 2011

I agree and there used to be (maybe it's still the case) problems with parser generators if you wanted to have good error recovery and reporting to the user. It's also very telling that sometimes--contrary to what people might expect--parsers pose a substantial problem in production systems: http://cacm.acm.org/magazines/2010/2/69354-a-few-billion-lin...:

Law: You can't check code you can't parse. Checking code deeply requires understanding the code's semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is naïve, rooted in the widely believed myth that programming languages exist.

buff-a · on Aug 26, 2011

Fair point. But I have been in the situation where I've got a RDP that outputs what I originally wanted, but now I want, e.g., to output an intermediate form, and I'm fucked.

So writing it isn't all that hard. But changing it after you've written it is a recipe for disaster. I did that a couple of times and looked for A Better Way.

A tool like ANTLR, OTOH, makes re-purposing (and initial development and testing) easy, but at the expense of runtime performance.

Perhaps the lesson is, "write a few RDPs to get the hang of it and then make life easy on yourself"?

kenjackson · on Aug 26, 2011

The parser is the easy part about writing a compiler, unless you design a language that is difficult to compile. Optimizing the codegen is the hard part. That's where you have to think really carefully about throughput versus code quality.

Locke1689 · on Aug 26, 2011

I wouldn't call writing a parser easy, but I would call it relatively straight-forward. Parsing code can be quite messy. My master's advisor likes to say that writing a parser is a character-building exercise and nothing but.

By the way, if you decide to write a parser, learn about Packrat parsing[1] and ignore everything else (especially stuff from the Pascal days). Back when memory was scarce it was important to build multistage compilers that optimized for low memory usage (because you had to keep the structures in memory). Nowadays that's not even a concern and packrat parsing is simpler and will be faster than almost every cute technique from 1970.

[1] http://en.wikipedia.org/wiki/Packrat_parsing

timtadh · on Aug 26, 2011

Hate to disagree with you. But having written many parsers (and even a few parser generators) it really isn't that difficult if you understand the theory. People encounter problems usually because they don't understand the theory and don't know how to write a grammar.

I have never used a PEG based parsing system so I can't speak to it. The wikipedia page seems to be riddled with inaccuracies. For instance it strongly indicates that context free grammars often end up with unresolvable ambiguities eg.

> "The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a context-free grammar, this construct yields the classic dangling else ambiguity.)"

Nothing could be further from the truth. It is relatively easy to resolve an ambiguity arising from this structure. Indeed, Aho et all use it as an example for how to eliminate ambiguity from a language[pages 211-212 in the second edition. section 4.3.2 "Eliminating Ambiguity"].

Perhaps you could explain why you think PEG and Packrat are superior to CFG and standard parsing tools (eg. recursive descent or LALR and GLR parser generators).

Locke1689 · on Aug 26, 2011

Hate to disagree with you. But having written many parsers (and even a few parser generators) it really isn't that difficult if you understand the theory. People encounter problems usually because they don't understand the theory and don't know how to write a grammar.

From a software engineering perspective it's annoying. It's not terribly difficult and it's not that interesting, it's just annoying.

Nothing could be further from the truth. It is relatively easy to resolve an ambiguity arising from this structure. Indeed, Aho et all use it as an example for how to eliminate ambiguity from a language[pages 211-212 in the second edition. section 4.3.2 "Eliminating Ambiguity"].

I agree that this is a bad example but you should remember from the literature that CFG is undecideably ambiguous. It's of course resolvable by GLR, but that's not the point.

For more information I'd recommend you read these papers, starting with Ford's master's thesis. [1] Also remember that I wasn't talking about using packrat vs. a standard GLR parser -- I was talking about what you should do if you want to learn how to implement a parser. Packrat is easier to get right than the corner cases in GLR.

[1] http://pdos.csail.mit.edu/~baford/packrat/

silentbicycle · on Aug 26, 2011

The LPEG paper (http://www.inf.puc-rio.br/~roberto/docs/peg.pdf) is also pretty good. (LPEG is a PEG parsing library for Lua.)

timtadh · on Aug 26, 2011

> For more information I'd recommend you read these papers, starting with Ford's master's thesis. [1] Also remember that I wasn't talking about using packrat vs. a standard GLR parser -- I was talking about what you should do if you want to learn how to implement a parser. Packrat is easier to get right than the corner cases in GLR.

Thanks for the link. I will definitely check it out.

> From a software engineering perspective it's annoying. It's not terribly difficult and it's not that interesting, it's just annoying.

It is not clear what you think is annoying (eg. learning the theory or writing a parser). I take it from later on in the post you believe writing the parser is annoying. I don't think writing parsers is in general annoying, but there are some languages which are annoying to write a parser for.

I can't agree with your advisor's statement about it being only a "character building" exercise. There are times when you need to write a parser. However, I have to admit if I can get around it I will. For instance last year instead of writing a Java parser I hooked the OpenJDK parser and serialized out an AST. Much easier and a lot less code than a parser. I would always recommend doing something like this for an existing language. It is too easy to mess up a corner case in the grammar.

tl;dr : I think we mostly agree, I just haven't read up on PEG's and Packrat.

kragen · on Aug 26, 2011

My limited experience with PEG parsing is that, in its simplest incarnation, it's not all that fast. Packrat parsing supposedly actually makes it slower in the normal case (where you don't backtrack much).

I still highly recommend it, though. It's very simple indeed (my self-compiling PEG parser generator is 66 lines of code: https://github.com/kragen/peg-bootstrap/blob/master/peg.md) and very expressive.

Locke1689 · on Aug 26, 2011

The production ANTLR and GLR parsers are very fast, but I think packrat is just easier to get right if you're just starting out.

joeyespo · on Aug 26, 2011

Yes, you're correct.

I think the point of this though is to first enable people to write compilers without the overwhelming complexity. And only dive in deeper after they outgrow their toy language and its inefficiencies.

Goladus · on Aug 26, 2011

Just a note for anyone reading the Crenshaw book, in chapter 9 he notes:

    The C language is quite another matter, as you'll see.   Texts on
    C  rarely  include  a BNF definition of  the  language.  Probably
    that's because the language is quite hard to write BNF for.

There is now an excellent reference that includes a full BNF Grammar for C(C99): http://www.amazon.com/Reference-Manual-Samuel-P-Harbison/dp/...

(This book didn't exist when the tutorial was written)

russell · on Aug 26, 2011

IIRC C was not context free, a requirement for BNF. The problem was in variable declarations, probably structures, but I dont remember the details. (It may have been fixed by C99.) It is often possible to transform a grammar from context sensitive to context free by trickery in the lexer. Because of this I dont recommend C as a first compiler language, although it makes an excellent target language.

kragen · on Aug 26, 2011

The issue, as I understand it, is that you need to know whether an identifier in certain contexts is a type name or not. C99 increases the number of contexts. Consider:

    x * y;

If x is a type, this declares y as a pointer to elements of that type, if you're at the beginning of a block or in C99. If x is a variable, this is a multiplication in void context, unless you're at the top level of the translation unit. A somewhat more interesting example is

    x *f();

which may declare f or call it, depending on what x is. Of course, lacking the ability to overload the multiplication operator, there's no reason you'd ever do a multiplication in void context, but it is legal.

If your lexer can distinguish type names from other identifiers, the problem is solved, and you can use BNF from there.

6ren · on Aug 26, 2011

> The authors promote using dozens or hundreds of compiler passes, each being as simple as possible.

This kind of approach can seem wrong, because it's breathtakingly, disturbingly inefficient, but it's an excellent way to break down a problem, so you can see it, play with it and understand it. It's much easier to write an efficient version once you know what the hell you're doing.

onan_barbarian · on Aug 26, 2011

The issue of doing tiny passes is not so much efficiency or the lack thereof; it's that combining passes can allow for better decisions to be made. This can result in better code quality than could be achieved independently. The classic example is the combination of register allocation and scheduling - scheduling can benefit from knowing what the actual operations are (e.g. will this thing be a load or will it be in a register) while register allocation can benefit from not having scheduling make overly aggressive movements of code that generate more live ranges than can be successfully allocated to registers.

Lots of papers on this sort of topic - the one I remember was (probably outdated): http://portal.acm.org/citation.cfm?id=106986

Integrating register allocation and instruction scheduling for RISCs (1991) by D G Bradlee, S J Eggers, R R Henry

That being said, many quite successful compilers manage to get by with simpler passes, sometimes using heuristics or multiple passes or just plain trickery. It's nice to be able to keep things simple. Especially nice for pedagogy, as above.

Compilers are lousy with 'phase ordering problems' where there's no clear-cut superior order in which to make certain decisions...

angrycoder · on Aug 26, 2011

Regarding compiler passes in LLVM: Depending on the optimization level, different passes are run: for example at -O0 (no optimization) the Clang compiler runs no passes, at -O3 it runs a series of 67 passes in its optimizer (as of LLVM 2.8).

http://www.aosabook.org/en/llvm.html

Locke1689 · on Aug 26, 2011

That's slightly different. First, that's post generation of the IR -- some of the steps that you want to combine are pre-IR. Second, the idea behind LLVM's optimizer structure is that each optimization provides a single task so they can be combined modularly. This may not produce the fastest code or fastest compile time but the idea is that the benefit in code organization would provide benefits that wouldn't be seen in a more tightly knit organizational structure.

thesz · on Aug 26, 2011

>The authors promote using dozens or hundreds of compiler passes, each being as simple as possible. Don't combine transformations; keep them separate.

Actually, this way you will get an inferior compiler (optimizer).

http://lambda-the-ultimate.org/node/2387

The thesis referenced above argues (and provides examples) that you cannot overcome important problems by separating transformations. You need to combine them.

silentbicycle · on Aug 26, 2011

The nanopass paper uses so many passes for pedagogy, not efficiency. Of course it's inefficient, that's not the point. They're still easier to teach in isolation.

jeffdavis · on Aug 26, 2011

Are you talking about efficiency of the compiler or efficiency of the resulting code?

If it's just an inefficiency of the compiler, you can dismiss it. But if it's an inefficiency in the resulting code, then perhaps the many-pass design isn't adequate for teaching optimization.

thesz · on Aug 26, 2011

The thesis I was talking about is about generated code.

Author of thesis is one of the authors of Blitz++: http://www.oonumerics.org/blitz/

The thesis came from his frustration when seemingly innocent changes provoke significant degradation in performance.

skew · on Aug 26, 2011

> The thesis referenced above argues (and provides examples) that you cannot overcome important problems by separating transformations. You need to combine them.

The transformations need to be applied together, but they can be specified together. One of the many impressive things in that thesis is the framework which lets optimizations be independently specified, but applied together. Unfortunately, the code does not seem to be available.

For a more recent system that echoes that particular aspect and does come with code there is Hoopl http://www.cs.tufts.edu/~nr/pubs/dfopt-abstract.html

It's on hackage http://hackage.haskell.org/package/hoopl

joeyespo · on Aug 26, 2011

I get a 403 error when trying to follow Lambda's link to the paper. Any mirrors?

thesz · on Aug 26, 2011

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.39... - thesis in PDF format.

gruseom · on Aug 26, 2011

Bob Barton taught the guys who wrote Burroughs Fortran to write a compiler in one day. Check out the following story. (Lengthy, sorry. But worth it. Also, I edited it a bit.)

Phillips programmers still had a soft spot in their hearts for the Burroughs 205. So when it came time for them to buy another machine they said that they would buy a Burroughs 205 computer if the following conditions were met:

A. It had to have a compiler that would compile and execute existing IBM 650 Fortransit programs with no modifications whatsoever.

B. The compiler had to compile faster than IBM's Fortransit.

C. The time for loading the compiler and object programs had to be faster than IBM's.

D. The object programs had to run faster.

A call was placed to Bob Barton... Bob said that he could not spend any more effort on the 205. All of his budget was allocated for 220 projects. However, if John Hale would designate three people for the project, he would fly to Dallas for one day and teach them how to write a compiler.

When I heard that someone was flying in from Pasadena to show us how to write a compiler, I was very skeptical. I had seen many other so-called experts from Pasadena and I was invariably disappointed.

The day that Bob spent in Dallas was one of the most amazing days of my life. I am sure that I never learned so much in an eight hour period. We stopped briefly for coffee in the morning and went out for a quick lunch. We did not take a break in the afternoon. The day was intense to say the least. I took a cab to the airport to catch his plane. He talked all the way to the airport and was still shouting information to me as he walked up the steps to the plane and disappeared into it. He said that IBM had spent 25 man-years on Fortransit, but that the three of us could do the job in six months.

They ended up being two guys (not three) and doing it in nine months (not six). Of course, compilers were simpler back then. But they were also far less well understood. These guys hit every one of those crazy requirements and invented virtual memory in the process.

Edit: here is the part about virtual memory. They had to do it to meet requirement D.

The goal of executing object programs faster than the IBM 650 sounded like real trouble to Bob. Both systems had a drum memory. The drum on the 650 rotated at 12500 rpm compared to 3570 rpm on the 205. However, the 205 drum was more sophisticated. It was divided into two areas. The primary storage was 80 words of 0.85 millisecond average access memory. The secondary storage was 4000 words of 8.5 millisecond average access memory.

Bob said that it seemed to him that our only chance of meeting the object speed goal was to figure out an "automatic blocking algorithm". I did not hear the term "virtual memory" until several years later. By an automatic blocking algorithm, he meant moving segments of code from the secondary storage into the primary storage and executing it from there. Since the first goal was to compile existing programs without modification, I would have to do it without the programmer adding any flow directing statements to the programs.

Bob said that a lot of people in Pasadena had tried without success to implement automatic blocking, but I should not let that stop me from trying. I would be the first person to try it with a high-level language. The success of the project seemed to hinge on that algorithm.

During the course of the next two months I did discover the algorithm. The next time that I saw Bob was in the Pasadena Plant in April, 1960. He was in the process of cleaning out his desk... I described the algorithm to him and he became tremendously enthused. Frankly, I had not grasped the importance of the accomplishment.

http://news.ycombinator.com/item?id=2856567. The whole memoir is wonderful. I laughed, I cried. Ok, I didn't cry. But it's all kinds of inspiring awesome.

tptacek · on Aug 26, 2011

Read this in one big slurp last week on vacation. Best thing I read online all summer. One of the guys Barton taught to write a compiler discovered recursive descent parsing and the scanner/parser/codegen model --- which he had to go around marketing.

pella · on Aug 26, 2011

The Definitive ANTLR Reference: Building Domain-Specific Languages

http://pragprog.com/book/tpantlr/the-definitive-antlr-refere...

-- + --

Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages

"Learn to build configuration file readers, data readers, model-driven code generators, source-to-source translators, source analyzers, and interpreters. You don’t need a background in computer science—ANTLR creator Terence Parr demystifies language implementation by breaking it down into the most common design patterns. Pattern by pattern, you’ll learn the key skills you need to implement your own computer languages." http://pragprog.com/book/tpdsl/language-implementation-patte...

inoop · on Aug 26, 2011

ANTLR is great, but I also highly recommend xtext: http://www.eclipse.org/Xtext/. It generates a full lexer/parser, AST representation, does automatic cross-referencing, and even generates an Eclipse plugin for your language with syntax highlighting and code completion.

tonyedgecombe · on Aug 26, 2011

Although some of the examples in that book use ANTLR, they all cover the theory behind the use, it is much more than an ANTLR manual.

mavelikara · on Aug 27, 2011

I found Peter Sestoft's "Programming Language Concepts for Software Developers" an enjoyable read. The book incrementally builds two compilers - one for Micro-ML and one for Micro-C. The implementation language is F#.

http://www.itu.dk/courses/BPRD/E2010/plcsd-0_50.pdf http://channel9.msdn.com/Blogs/martinesmann/Teaching-program...

krishna2 · on Aug 26, 2011

Surprised that no one has yet mentioned EOPL ("Essentials of Programming Languages"). An awesome book imho.

motxilo · on Aug 28, 2011

That book is more directed towards interpreters.

Jun8 · on Aug 26, 2011

If one wanted to jump in and write one, what's a good language to write a compiler for, provided the resulting thing would be of some use to the developer (and maybe to others) and not a toy? Coffescript? Scheme? Or just write a parser for, e.g. Markdown?

andrewflnr · on Aug 26, 2011

Personally, I would make something up. If you want it to be useful, make it into an embeddable scripting language. But if you're doing this to learn, I wouldn't worry too much about making it "useful", at least not the first draft.

This is the approach I'm taking; I got into compilers because I had ideas about languages. YMMV

wglb · on Aug 26, 2011

This is good advice, but Not surprisingly, the opaqueness of these books has led to the myth that compilers are hard to write. Having written a couple, I can assure you that writing a compiler is hard. But that is why it is worthwhile.

edtechdev · on Aug 26, 2011

In my opinion, the best way is to study and contribute to an existing compiler.

There are dozens to choose from that compile to javascript: https://github.com/jashkenas/coffee-script/wiki/List-of-lang...

or java's JVM: http://en.wikipedia.org/wiki/List_of_JVM_languages

or .net/mono's CLI: http://en.wikipedia.org/wiki/List_of_CLI_languages

digitalbanana · on Aug 26, 2011

i had to put together something resembling a compiler for a class, i used mainly two sources:

for the theory part: http://www.diku.dk/hjemmesider/ansatte/torbenm/Basics/ (Basics of Compiler Design - Free book)

actualy put something together: http://www.dabeaz.com/ply/ (PLY (Python Lex-Yacc))

maybe not the solution for real world use, but helped me jump past some nitpicking parts with the C / lex / yacc implementation.

trurl · on Aug 26, 2011

I learned more about modern compilation by reading and implementing "From System F to Typed Assembly Language" than I did from the compilers course I took.

http://www.cs.princeton.edu/~dpw/papers/tal-toplas.pdf

silentbicycle · on Aug 26, 2011

I'd guess that most people who are completely new to compilers don't know how to read denotational semantics, either. Handing that paper to them would just reinforce the "compilers are perceived to be magical artifacts" view that Crenshaw is trying to dispel.

He intentionally avoid Greek letters, let alone denotational semantics. I don't doubt that it's a good paper, but the people Crenshaw is addressing probably wouldn't be able to read it (yet).

trurl · on Aug 26, 2011

The title of the post is "Want to Write a Compiler? Just Read These Two Papers." I upped the ante by proposing they just read one better paper.

I think if you had actually bothered to read the paper rather than make assumptions you would find the word "denotation" doesn't even appear in the paper.

silentbicycle · on Aug 26, 2011

I think most people would look at the top of page 7, go "eek!", and leave with the idea that compilers are insurmountable magic. That's all.

For me, being able to read that stuff came much later than learning about compilers. (I haven't read that specific paper yet. Sent it to my kindle, though.)

I could similarly recommend some papers on compiling via CPS, but the people Crenshaw is addressing would just find them intimidating - they're still getting to grips with the basic techniques and terminology.

plaes · on Aug 26, 2011

Hum.. can someone tell him that he spelled Knuth wrong :S

telemachos · on Aug 26, 2011

The Crenshaw tutorial uses Turbo Pascal. Any recommendations for Pascal on modern Macs?

FraaJad · on Aug 26, 2011

FreePascal. http://en.wikipedia.org/wiki/Free_Pascal

It even comes with a console IDE very much similar to Turbo Pascal. Not tried on a Mac though. However, there is the GUI IDE called Lazarus.

jc-denton · on Aug 26, 2011

I strongly disagree with "Learn a parser toolkit". If you want to do it yourself you should write also a parser yourself.

compay · on Aug 26, 2011

Want to know Modern Philosophy? Just read Kant's "Critique of Pure Reason." After all, how hard could one book be?

zerd · on Aug 27, 2011

I would like to recommend the LLVM tutorial too: http://llvm.org/docs/tutorial/LangImpl1.html

It takes you from zero to codegen in a few tutorials, using Recursive Descent Parsing and Operator-Precedence Parsing.