It's always great to see more people writing their own compilers - a "proof by example" that compilers do not have to be magical and mysterious special software that only few experts can write. The theory may be deep, but in practice getting a basic working compiler doesn't require all that much.
I notice you're using a recursive-descent parser, which I think is a good idea because it can actually reduce complexity over one made with parser-generators; in fact it's simple enough that even beginners who have just grasped recursion should be able to understand and write one. Once you have a basic parser that can evaluate arithmetic expressions, extending that to generating AST nodes and parse a (mostly) simple language like C is not so difficult. If I remember correctly, at least one college has an expression evaluator as a first-year CS assignment related to recursion. It's interesting that one of the sections of K&R, on reading and writing C declarations, even contains a simple parser for them --- I'm not aware of any introductory books for other languages which contain small pieces of their implementation.
That said, you can use the "precedence climbing" technique[1] to collapse many of the recursive functions for most of the expression operators into one function with a succinct loop and a table, making your parser even simpler and faster. C4[2] is an example of this technique but with a series of if-else in the body of the function instead of a table, and I think is worth some careful study just for its mindblowingly awesome ridiculous simplicity.
C is notoriously painful to parse with a parser generator because it's not context free --- consider this statement:
foo * bar;
It can't be parsed until you know whether foo is a type or a variable --- indeed, depending on the implementation, it can't even be tokenised correctly. So you need hooks from the symbol table back into the tokeniser/parser. I'm not at all surprised that the author's using a hand-written parser, as it reduces the problem space considerably.
(There's a really good article which I've been trying to find which goes into details about all the weird parsing edge cases due to the rules about exactly when a word goes into the symbol table which I've been completely unable to find...)
C is completely easy to parse, because it reduces to a LALR(1) parseable context-free language after the semantic mapping of input symbols into abstract lexical categories. In fact, it can be parsed by recursive descent with a suitable factoring, so that it is almost certainly LL(k), possibly with k = 1. Pardon me, k == 1. :)
That mapping requires semantic information (type declarations), but that information is something the compiler must maintain anyway so that it can type check the language.
Yes, if we simply treat "foo" as an IDENTIFIER token and feed that to the parser, thing swill be difficult. But we don't; what we return to the parser is a TYPESPEC token, based on peering into the current parse-time lexical environment where we resolve "foo" to a typedef name. The parser then quite easily goes through the state transitions corresponding to its declaration rule.
Sure, a parser which maintains parse-time lexical environments which are also visible to the lexer is slightly harder to write than one which doesn't have such stuff (and just returns an unadorned AST which you can walk again to build that info).
But if the year is 1973, you would likely do that anyway even if the language doesn't require it. Why? Because building the compile-time environment tree reduces the number of passes: from a single pass through the raw code, you already have the syntax tree with type info (and possibly type info that has even been checked).
The context-free terminology doesn't apply to C because the ambiguity rests in not knowing what is the lexical category of foo in "foo * bar". That information comes from semantics.
A language whose semantics you have to understand to understand its syntax is outside of the entire formal language realm inside which we formally recognize a "context-free" category. That formal language realm is purely syntactic: the terminal symbols are just "dumb" and stand for themselves. They are subject to rules which indicate how those symbols are combined.
C is "quasi context free" in that if we resolve what the raw tokens mean (which one is a type and which one isn't), then the resulting language is context-free. In fact stronger than context free: amenable not only to LALR(1) parsing approaches but to recursive descent, with a suitable refactoring of the left-recursive rules given in the standard.
Your post doesn't really agree with what I was taught.
> the ambiguity rests in not knowing what is the lexical category of foo in "foo * bar".
Right, so there are 2 possible parse trees. Being able to disambiguate them with more information doesn't mean it's not context-free. Given just a CFG, you were still able to determine that you have a syntactically valid C program.
> A language whose semantics you have to understand to understand its syntax is outside of the entire formal language realm inside which we formally recognize a "context-free" category.
A formal language is just the set of strings that belong to it. That's it. An absurd example might be "the set of strings corresponding to valid C programs that do not terminate". To recognize such a language, you would not only have to understand the semantics of C, you would even need to solve the halting problem, but that doesn't make it "outside of the entire formal language realm".
> A formal language is just the set of strings that belong to it. That's it.
That is correct, and in that framework, we cannot reason about the meaning of "foo"; that if it's a type apply this rule otherwise that rule.
If we do that reasoning, what we're really doing is replacing "foo" with one of two other symbols, and then parsing those; then we have different strings (so of course the parse trees cannot be the same, no matter what).
We don't need to decide to apply this rule sometimes but other times that rule. When we come to an ambiguous string, we apply both rules nondeterministically. Here's a pseudo-grammar that I think describes both possibilities.
Statement -> Type Op Var ";"
Statement -> Var Op Var ";"
Op -> "*"
I think the disconnect here is that most lexers would insist on feeding a different token to the parser depending on whether "foo" is a Type or a Var, and choosing one or the other means the lexer might feed it bad information. The pragmatic way to implement this in a real compiler is to add some logic to the lexer so that it's working with more information than just the syntax. Theoretically though, all that really matters is that both possibilities are syntactically valid. So another way to implement it would be to pass on both possibilities, letting the parser eliminate the one that doesn't compile.
Indeed. There is a rule that Type generates Identifier, which can generate an example like foo. Likewise, Var generates Identifier, which generates foo. But these generations are not purely grammar rules; Type can only generate foo if there exists a declaration in the semantic space. If we regard it as purely a grammar rule, then we have a straightforward ambiguity in a context-free language. It is context-free simply because the rules are all of the form one_sym -> zero_or_more_syms.
If C were parsed this way, nondeterministically, ultimately the ambiguity would be resolved by looking up the type info anyway. (The interesting possibility exists, though, that in some cases the type info could be inferred, based on how the declared identifier is used.)
> you need hooks from the symbol table back into the tokeniser/parser
not necessary. you can add a bit of information to each identifier token: is this the name of a type or not?
you have to split up one of the parser passes, so if you had a two pass "tokenizer" and "LR parser" compiler now you have a tokenizer pass followed by two LR parser passes. The first LR pass is "last mile tokenization". I guess another word would be a (very short) cascade tokenizer/parser.
The intermediate pass adds a piece of metadata to each token marking it as the name of a type or not (this makes the language actually context free instead of just nearly context free). If you do this, then the second pass doesn't see this as two identifiers at all because it knows which identifiers refer to types and which do not.
It's a different application (computing scope data instead of type data) but I've implemented the basic concept of annotating token data with the result of an intermediate parsing stage here...check out line 318 here:
There are also readable code bases like tinyc or this one.
However these always use a simple template based approach to code generation and stop there.
Of course there are real world compiler (lvm, gcc, ...). However because of their size and complexity it's hard to learn from them.
What I am interested in would be going from:
"I have an AST and for this node type I output this assembly."
to
"I translate the AST to an IR. Transform an the IR in multiple passes and finally output decent asm."
That is, something that talks about CFG analysis, register scheduling or maybe SSA form? I have read papers on SSA and Sea-of-Nodes IRs, but these of course always assume that you know how and why to use them. A more approachable text/github-repo that shows how to take the step from template base code gen to more advanced techniques would be great.
(I hope it is ok to ask this here. Did not want to hijack the thread.)
After grinding through a bunch of compiler books I have to mention "Modern Compiler Implementation in C - Andrew A. Appel": http://www.amazon.co.uk/dp/0521607655 as one that stands out.
Another space you can look for interesting things (not necessarily optimising compilation) is the compilation of functional programming languages. They usually take a slightly different approach compared to imperative languages, though there's clearly a lot of overlap. I really enjoy reading in an old book called The Implementation of Functional Programming Languages from 1987, which is available online: http://research.microsoft.com/en-us/um/people/simonpj/papers...
It's a small (few thousand lines), rather neat, weakly optimising compiler for a subset of OCaml, written in OCaml. It has more than one backend, a few optimisations, a register allocator. No SSA form or 'advanced' stuff though.
Someone I know is writing a tiny llvm clone in C. It already can do integer operations faster than gcc -O1.
It is currently closed source but it shouldn't be hard for my compiler to target his IR in the future. His backend is currently around 6k lines of code so should be relatively easy to understand, though understanding some of the concepts is harder overall.
You might want to try and have a look at sbcl (steel bank common lisp) along with its documentation and various articles leading up to/documenting its current design. It's a real, and therefore complex, system - but I recall reading some rather approachable articles on it.
I don't really know about that many other real, yet simple(ish) systems. Maybe Pharo Smalltalk, luajit, pypy, gnu guile (latest version)?
pcc is simple, but very old school and a bit contorted.
libfirm is small and clean and generates excellent code. (Despite the name, it's a complete ANSI C compiler, as well as being a LLVM-light.)
The ACK is easy to port and is a complete standalone toolchain including compilers, assemblers and linkers, but (at least on modern processors) produces code so bad that it will make you want to stab yourself in the eyes to make the pain go away.
There's a look-but-don't-touch licensed compiler called vbcc which is by far the best C compiler I've ever seen; small, easy to understand, easy to port, and produces excellent code... but the author doesn't want to release it under an open source license.
nwcc: small, slightly unfinished, looks like it's stalled.
sparse: technically this isn't a compiler, it's a code linter --- but it contains a complete ANSI C front end and a reasonable amount of a backend and can be made to generate code without much effort. I've used it to compile C to Java and Perl. There's no optimisation layer and it's a little buggy (it's not really really intended as a compiler).
libfirm doesn't really classify as small to me, it is currently 132,638 lines of code. The cparser alone is 11k lines of code (5 times the size of my c parser).
That's true. I was rather thinking of it in comparison to LLVM and gcc, which are both space-distorting behemoths. (The catchphrase I've seen is that you can compile libfirm in less time than it takes to run the gcc configure script!)
The GCC extensions are just that - they're not part of the standard. You can claim 100% standards conformance without implementing a single one of them.
That said, in practice a large amount of C code out there is not pure standard C so if you want to be able to compile something "interesting" like e.g. a Linux kernel you'll need to implement some.
Tcc is far faster. I have done some benchmarks and tcc is really incredible. He was just optimising for a different goal I think.
edit:
Compiling my parse.c file, tcc is 34 times faster than gcc -O0 and my compiler is 3-4 times faster than gcc -O0. I should make a post with the actual stats.
It's cute how he compares his "for loop parser" that handles C for-loops with Clang's, that handles C++ stuff including range loops - to show how his code is simpler :) Silly silly Clang developers for making their code too complex!
You felt the need to make this comment and that shows you don't have confidence average people will be able to work that out just from looking at the code, which was my point all along.
Of course I know clang handles more, it is also designed as a library for other tools. The fact that clang mixes C and C++ parsing is not a merit, more like a necessary evil. This doesn't contradict my point on code clarity at all.
> It gives me a funny sense of pride that my compiler can now be used to improve itself.
Only with difficulty, because your compiler is written in a language that is poorly suited for writing compilers.
How about using nice high level language whose implementation is partially written in C to write a C compiler in that nice high level language which then compiles that partially-written-in-C component of the implementation of that high level language.
You would be surprised how the logic for a c compiler is pretty densely compressed, even in C code. You don't actually gain too much by switching to haskell or ocaml (I have done experiments).
I notice you're using a recursive-descent parser, which I think is a good idea because it can actually reduce complexity over one made with parser-generators; in fact it's simple enough that even beginners who have just grasped recursion should be able to understand and write one. Once you have a basic parser that can evaluate arithmetic expressions, extending that to generating AST nodes and parse a (mostly) simple language like C is not so difficult. If I remember correctly, at least one college has an expression evaluator as a first-year CS assignment related to recursion. It's interesting that one of the sections of K&R, on reading and writing C declarations, even contains a simple parser for them --- I'm not aware of any introductory books for other languages which contain small pieces of their implementation.
That said, you can use the "precedence climbing" technique[1] to collapse many of the recursive functions for most of the expression operators into one function with a succinct loop and a table, making your parser even simpler and faster. C4[2] is an example of this technique but with a series of if-else in the body of the function instead of a table, and I think is worth some careful study just for its mindblowingly awesome ridiculous simplicity.
[1] https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing
[2] https://news.ycombinator.com/item?id=8558822