I love c4. The author writes very readable and simple code. I recommend trying out swieros as well “a tiny Unix like kernel”: https://github.com/rswier/swieros
Instead of just gazing at the sea of terse variable names in awe, try actually reading it! There's only about a dozen variables and they're all documented at the top - the actual code is actually amazingly clear for what it does. There's not a lot of gratuitous cleverness.
For example, just picking a random segment, you don't have to squint very hard to see that this is a number literal parser:
else if (tk >= '0' && tk <= '9') {
if (ival = tk - '0') { while (*p >= '0' && *p <= '9') ival = ival * 10 + *p++ - '0'; }
else if (*p == 'x' || *p == 'X') {
[...goes on to handle the hexadecimal case...]
(aside - I love the conversion from string to decimal by subtracting the string value of '0', as this will work for any text encoding where the decimal digits are monotonic and contiguous - so ASCII and EBDIC at least...)
> I love the conversion from string to decimal by subtracting the string value of '0'
That's how everyone did character arithmetic since forever, though, especially with the letters. Wouldn't be surprised if it's in K&R. And it probably became a subtle source of errors when environments changed, as mixing semantics and implementation tends to do.
A bit terse perhaps, but only thing that is a bit more than a simple tokenization/parsing and direct output is the precedence climbing part for parsing expressions. That may cause some headscratching for some if they haven't implemented it in the past. But it's easy to read up on that algorithm on wikipedia. Anyway, even using that ultimately makes code more readable than trying to implement the same with recursive descent.
yeah, I wrote a mini xml parser in python and the code is really similar ... I'm sure most handwritten parsers are. Or maybe I'm an anonymous demoniac in denial.
The major breakthrough of C4 is that it's both terse and yet still very readable; previously the only other well-known self-compiler of that size was OTCC, which is deliberately obfuscated.
The usage of the 2 character variable names is a "refreshing". At least the author of the code was gracious enough for the following to help debugging.
So it's not actually a compiler, it's a bytecode compiler + interpreter (like python) where the names of the bytecodes match what an x86 assembler would use, but the encoding doesn't (and it doesn't know how to create or link .exe ELF or whatever macos uses)
c4 is very cool, but it's not a compiler in the usual sense of the term. It compiles to a kind of bytecode which it then interprets (like Python, for example), but it doesn't generate assembly or machine code (like GCC or JIT compilers, for example).
But there's a more realistic compiler inside swieros: https://github.com/rswier/swieros/blob/master/root/bin/c.c that emits actual "binaries" for the emulated toy CPU. It's hard to tell what exactly is going on though, as this is not "very readable and simple code".
It's just a small step to generate machine code, this fork [1] can generate an executable, or run it JIT style, both on x86 (there's also the 'arm' branch)
In this vein, and related to the Trusting Trust Attact section, I recently came across an x86 project that works to boostrap from nothing, reling on zero external software, even for compilation.
To that end it is organizes into "phases" where the lower levels bootstrap the higher ones, with level zero essentially coding directly in x86 opcodes.
I really want to dig into it more, but for the life of me cannot find the github page again. Any ideas?
As far as I can tell, the top-level directory is the actual compiler itself. 35k lines is pretty small for a real C compiler that can bootstrap GCC (the versions that were still written in pure C).
> And one cannot just throw away a backend and say that it's not part of the compiler.
OK. Let me rephrase: TCC doesn't have 35k lines of code, but with five minutes' work it could be turned into a compiler with 35k lines of code capable of bootstrapping GCC on Linux. That should be enough to compare it to the ones you list.
That's an interesting way to put it. :-D I believe we could do that to a lot of other compilers, say, lcc (30k) with its multiple backends and get something like 10-20k.
Anyways, what I was trying to say is that "tiny" in "tinycc" has lost its meaning already.
There is also https://git.sr.ht/~mcf/cproc inspired by several other small C compilers including 8cc, c, lacc, and scc. I did not take a deep look at it yet, but it looks interesting.
Unfortunate that "CUCU", "C Interpreter", and "Small C for I386 (IA-32)" are already dead links, although the page history shows that it was created only a little over 2 years ago.
These look neat but they all seem to be targeting a subset of C (e.g. no struct support) instead of the full thing (even if we include stdlib). Only exception seems to be SmallerC (which as expected is larger than the other compilers/interpreters that only support a subset).