I've been sort of working on a compiler side project for years, and my observation of the materials out there is that they all seem to get caught up in bikeshedding. Parsing just isn't that hard of a problem, and with a vague understanding of what a recursive descent parser is, just about any reasonably-skilled developer can figure out how to write a good-enough parser. EBNF, parser generators, etc. is just noise, yet this is where every compiler tutorial I've come across spends most of its time.
The real meat of any compiler is generation, and that's hard, starting with the fact that if you're new at this you probably don't know assembly. And assembly is death by a thousand cuts: sure, you can probably figure out how to do loops and ifs pretty quickly, but the subtleties of stack allocation, C-style function calling convention, register allocation, etc. all add tiny bits of "messiness" that each add a little bit of complexity to your code. Managing that complexity is the real challenge, and I haven't come across a compiler course that really addresses this.
I couldn't agree more with this, it's very frustrating. I recently worked through most of the "make a lisp" guide [0], and it was excellent introduction to a simple lisp interpreter. But now I want to try my hand at a compiler (either producing machine code or maybe custom VM bytecode) and the lack of resources for that is bewildering. Just as you say, I've found that most online texts on compilers seem to spend most of their time on parsing, which is something I already feel relatively comfortable with.
Does anyone know of a good resource on how to transform an AST to machine code? I've skimmed Crafting Interpreters and the Bytecode VM section seems to spend an inexplicable amount of time on not only parsing, but minutia having nothing to do with compilers (e.g. writing a custom dynamic array implementation in C).
I don't have a good resource, but having just finished a small C-compiler myself[0] after previously getting stuck on the codegen part, I have a few suggestions:
Keep it simple and focus on making it composable. By that I mean if you're targeting x86 keep the current value in [r|e]ax (forget about non-integer stuff for now :), if it's a stack based VM: focus on the stack top. A numeric literal loads [r|e]ax (or pushes it on the stack). Binary operators first pushes computes the lhs; pushes the result; computes rhs, combines and leaves the result in [r|e]ax (or the stack top).
Output to some kind of text format and lean on existing tools. It's probably not a bad idea to output C, LLVM bitcode, JVM/.NET bytecode or x86 assembly at first rather than going all the way to machine code starting out.
Postpone thinking about optimizations and reduce your level of ambition. All of my previous attempts at finishing compiler projects failed because the complexity got out of hand. For instance in a C compiler don't try to optimize expressions of smaller-than-int types. Maybe all expressions in your language can be calculated using doubles?
Finally as GP says: You need to be comfortable with the output language (assembly or otherwise). Study the output of similar compilers and translate code by hand into assembly to get a feel of what your compiler should be doing.
Have you actually done this? Do you really think we haven't?
The Appel book in particular is very good, but it only covers assembly generation at a high level. None of the examples target a real-world architecture.
I don't know what you have done. The red dragon book goes over targeting an intermediate form that's basically assembly language, it also covers register allocation in depth. As far as I can tell, most of the texts after that is in research papers and real world examples (https://c9x.me/compile/)
My university compiler course focused on the theory of grammars and token parsing - when it came to developing a compiler for the practical demonstration part of the course the code generation aspect seemed really poorly documented. In the end I used LLVM. Code generation is still the part about compilers I understand the least.
Compiler courses do spend way too much time on parsing, but there is real value in learning it. An LR(k) or LL(k) grammar is guaranteed to be unambiguous, so knowing how to construct one is the best way to design a language, even if parser generators are rarely used to actually build parsers. Furthermore, the industry standard to describe the syntax of a programming language is some form of BNF grammar, with semantics usually given by prose explanations of the operational semantics for every production in the grammar.
> starting with the fact that if you're new at this you probably don't know assembly
Most university curricula put assembly programming at a sophomore-level course (generally as part of an intro to computer architecture course), while compilers are a senior-level course. As a result, most compiler courses will assume a background knowledge of assembly.
> Compiler courses do spend way too much time on parsing, but there is real value in learning it. An LR(k) or LL(k) grammar is guaranteed to be unambiguous, so knowing how to construct one is the best way to design a language, even if parser generators are rarely used to actually build parsers.
So? Run your recursive descent parser, if it works, it works. There might be edge cases: when you come across them fix them. Perl has demonstrated that a language doesn't even have to be decideable to be successful. This is a useful tool for getting advanced computer science degrees, not so much a useful tool for writing compilers.
> Furthermore, the industry standard to describe the syntax of a programming language is some form of BNF grammar, with semantics usually given by prose explanations of the operational semantics for every production in the grammar.
Moreover, this documentation isn't even necessary for a lot of compilers. If you don't have a working compiler, nobody cares if you have a BNF. If you don't plan to submit your language to a standardization body who isn't going to accept it for at least a decade anyway, you probably don't need a BNF.
Have you ever actually learned the syntax of a language by looking at a BNF? Or have you learned from examples, like almost[1] every[2] tutorial[3] in[4] existence[5]?
And what's more, I learned BNF well enough to describe a grammar to anyone who needs to know it in about 30 minutes of class time in college. So even if you somehow get to the point where you're submitting your language to ANSI for standardization, you can go back and learn BNF then.
> Most university curricula put assembly programming at a sophomore-level course (generally as part of an intro to computer architecture course), while compilers are a senior-level course. As a result, most compiler courses will assume a background knowledge of assembly.
I took those classes. At my college it was two semesters. We learned a very small subset of MIPS that would be insufficient even to write a compiler targeting MIPS, let alone a compiler targeting x86. Books like The Art Of Assembly Language give you enough x86 assembler, but there's still a gap between understanding x86 and generating x86. And you'll have to understand C semantics to talk to the rest of the world in code, which means you get to dig into the C standard and also into some implementations of that standard.
Well, you got me one line of a five paragraph post! Congrats.
How did that work out for you? Were you able to write meaningful programs just by seeing what text was matched by the parser, without knowing what that text did?
You asked if I learned the syntax from the BNF, not if I learned the semantics from it. Semantic information comes from the prose description of how each production is to be semantically interpreted, although many specifications are very lacking in thoroughness here.
On any project with more than one person (and arguably even many projects with a single person, for the you of today is not the same you of yesteryear), communication is absolutely essential. And if what you're describing is a language, than the BNF for the syntax and corresponding prose for semantics is the most effective way we know of to document and communicate that language. If instead you try to wing it and rely on examples to communicate the specification, then the resulting mess will take you several times as long to clean up as actually properly designing out the documentation would have taken in the first place--and this I speak from bitter experience.
P.S. I didn't engage with the rest of your post because I wasn't interested in writing an essay to rebut your points.
> You asked if I learned the syntax from the BNF, not if I learned the semantics from it.
Actually, I didn't ask you anything--it was a rhetorical question, within the context of a post you ignored.
> On any project with more than one person (and arguably even many projects with a single person, for the you of today is not the same you of yesteryear), communication is absolutely essential. And if what you're describing is a language, than the BNF for the syntax and corresponding prose for semantics is the most effective way we know of to document and communicate that language. If instead you try to wing it and rely on examples to communicate the specification, then the resulting mess will take you several times as long to clean up as actually properly designing out the documentation would have taken in the first place--and this I speak from bitter experience.
BNFs might the best way to communicate the syntax of a language to another person who needs to know the syntax to that level of specificity, but the fact that mainstream compilers often don't have widely available BNFs shows that a BNF simply isn't a vital part of compilation. Generating into a real target assembly is a vital part of compilation: if it doesn't generate to a useful target, it literally isn't a compiler.
I'm not against learning BNFs. They're useful. My point, from the beginning, has been that while assembly generation is vastly more complicated and important than parsing, the vast majority of language materials focus on parsing, giving only cursory coverage to generation.
There's literally detailed tutorials on how to use specific tools to do specific steps of parsing. Meanwhile, generation is covered, if at all, by generating partial pseudo-assembly for fictional architectures. Even if you manage to translate the pseudo-assembly into x86, good luck finding any guidance on how to link in a garbage collector, or properly tag the generated code with metadata for debugging or exception reporting, or how C calling conventions work so you can even implement exceptions on the stack... I don't want to hear about your experience with communicating about easy stuff like parsers until you've worked on a difficult problem.
> P.S. I didn't engage with the rest of your post because I wasn't interested in writing an essay to rebut your points.
If you don't feel like participating in the conversation, fine, just go away. Don't take cheap shots at out-of-context sentences where it's convenient, and pretend that you're not responding to the rest because it's beneath you.
The real meat of any compiler is generation, and that's hard, starting with the fact that if you're new at this you probably don't know assembly. And assembly is death by a thousand cuts: sure, you can probably figure out how to do loops and ifs pretty quickly, but the subtleties of stack allocation, C-style function calling convention, register allocation, etc. all add tiny bits of "messiness" that each add a little bit of complexity to your code. Managing that complexity is the real challenge, and I haven't come across a compiler course that really addresses this.