Small C Compilers

codezero · on Oct 10, 2019

I love c4. The author writes very readable and simple code. I recommend trying out swieros as well “a tiny Unix like kernel”: https://github.com/rswier/swieros

snagglegaggle · on Oct 10, 2019

https://github.com/rswier/c4/blob/master/c4.c

Holy what

dTal · on Oct 10, 2019

Instead of just gazing at the sea of terse variable names in awe, try actually reading it! There's only about a dozen variables and they're all documented at the top - the actual code is actually amazingly clear for what it does. There's not a lot of gratuitous cleverness.

For example, just picking a random segment, you don't have to squint very hard to see that this is a number literal parser:

  else if (tk >= '0' && tk <= '9') {
        if (ival = tk - '0') { while (*p >= '0' && *p <= '9') ival = ival * 10 + *p++ - '0'; }
        else if (*p == 'x' || *p == 'X') {
        [...goes on to handle the hexadecimal case...]

(aside - I love the conversion from string to decimal by subtracting the string value of '0', as this will work for any text encoding where the decimal digits are monotonic and contiguous - so ASCII and EBDIC at least...)

aasasd · on Oct 10, 2019

> I love the conversion from string to decimal by subtracting the string value of '0'

That's how everyone did character arithmetic since forever, though, especially with the letters. Wouldn't be surprised if it's in K&R. And it probably became a subtle source of errors when environments changed, as mixing semantics and implementation tends to do.

edflsafoiewq · on Oct 10, 2019

>as this will work for any text encoding where the decimal digits are monotonic and contiguous

The C spec requires this be the case for both the source and execution basic character sets.

Razengan · on Oct 10, 2019

Very readable? It's on the cusp of becoming a demon summoning ritual.

megous · on Oct 10, 2019

A bit terse perhaps, but only thing that is a bit more than a simple tokenization/parsing and direct output is the precedence climbing part for parsing expressions. That may cause some headscratching for some if they haven't implemented it in the past. But it's easy to read up on that algorithm on wikipedia. Anyway, even using that ultimately makes code more readable than trying to implement the same with recursive descent.

agumonkey · on Oct 10, 2019

yeah, I wrote a mini xml parser in python and the code is really similar ... I'm sure most handwritten parsers are. Or maybe I'm an anonymous demoniac in denial.

userbinator · on Oct 10, 2019

The major breakthrough of C4 is that it's both terse and yet still very readable; previously the only other well-known self-compiler of that size was OTCC, which is deliberately obfuscated.

klingonopera · on Oct 10, 2019

While this is about C++, John Carmack just came from C, so a lot of this applies to C as well: https://kotaku.com/the-exceptional-beauty-of-doom-3s-source-...

zik · on Oct 10, 2019

Oh wow. That's quite something.

yitchelle · on Oct 10, 2019

The usage of the 2 character variable names is a "refreshing". At least the author of the code was gracious enough for the following to help debugging.

    debug;    // print executed instructions

acqq · on Oct 10, 2019

It's beautiful. Here are examples of compiling c4 and using it, following the README:

    $ gcc -o c4 c4.c 2>/dev/null
    $ ./c4 hello.c
    hello, world
    exit(0) cycle = 9
    $ ./c4 -s hello.c
    1: #include <stdio.h>
    2:
    3: int main()
    4: {
    5:   printf("hello, world\n");
        ENT  0
        IMM  -274739184
        PSH
        PRTF
        ADJ  1
    6:   return 0;
        IMM  0
        LEV
    7: }
        LEV
    $ ./c4 c4.c hello.c
    hello, world
    exit(0) cycle = 9
    exit(0) cycle = 26015
    $ ./c4 c4.c c4.c hello.c
    hello, world
    exit(0) cycle = 9
    exit(0) cycle = 26015
    exit(0) cycle = 10060183

It works on Windows too, if you have gcc installed.

Also, see "Adding support for 64 bit targets" commit:

https://github.com/rswier/c4/commit/2feb8c0a142b2e513be69442...

And yes, swieros has much more.

candiodari · on Oct 12, 2019

So it's not actually a compiler, it's a bytecode compiler + interpreter (like python) where the names of the bytecodes match what an x86 assembler would use, but the encoding doesn't (and it doesn't know how to create or link .exe ELF or whatever macos uses)

tom_mellior · on Oct 10, 2019

c4 is very cool, but it's not a compiler in the usual sense of the term. It compiles to a kind of bytecode which it then interprets (like Python, for example), but it doesn't generate assembly or machine code (like GCC or JIT compilers, for example).

But there's a more realistic compiler inside swieros: https://github.com/rswier/swieros/blob/master/root/bin/c.c that emits actual "binaries" for the emulated toy CPU. It's hard to tell what exactly is going on though, as this is not "very readable and simple code".

bicolao · on Oct 10, 2019

It's just a small step to generate machine code, this fork [1] can generate an executable, or run it JIT style, both on x86 (there's also the 'arm' branch)

[1] https://gitlab.com/pclouds/c4

altermark · on Oct 10, 2019

c4 with tiny x86 JIT:

https://github.com/EarlGray/c4

https://github.com/EarlGray/c4/blob/master/c4x86.c

unkulunkulu · on Oct 10, 2019

I must be missing something about compiler design, but why does it seem to have an interpreter at the and of main?

pygy_ · on Oct 10, 2019

You went to the code too fast (a mistake I also commit often, since unlike comments/doc, code doesn’t lie...).

From the project description: “A tiny hand crafted CPU emulator, C compiler, and Operating System”

xelxebar · on Oct 10, 2019

In this vein, and related to the Trusting Trust Attact section, I recently came across an x86 project that works to boostrap from nothing, reling on zero external software, even for compilation.

To that end it is organizes into "phases" where the lower levels bootstrap the higher ones, with level zero essentially coding directly in x86 opcodes.

I really want to dig into it more, but for the life of me cannot find the github page again. Any ideas?

xelxebar · on Oct 10, 2019

Aaaaand, serendipidously, I just now stumbled upon it in an HN post a bit below this one:

https://news.ycombinator.com/item?id=21201413

That article mentions the project I was thinking of:

https://github.com/oriansj/stage0

Amazingly, it seems they are even aiming to boostrap some minimal hardware as well! Super cool.

senorsmile · on Oct 10, 2019

for some reason, I missed your reply to your own comment completely!

senorsmile · on Oct 10, 2019

This sounds very interesting. If you find it, will you please post a link to the project?

msclrhd · on Oct 10, 2019

What about the Tiny C Compiler (TCC) that Fabrice Bellard wrote? https://bellard.org/tcc/

Skywalker13 · on Oct 10, 2019

https://bellard.org/otcc/

vkazanov · on Oct 10, 2019

it not tiny anymore :-)

ainar-g · on Oct 10, 2019

Could I ask you what do you mean by that?

tom_mellior · on Oct 10, 2019

Not sure what the parent means by it, but here is lines of code data generated using David A. Wheeler's 'SLOCCount'.

    SLOC Directory SLOC-by-Language (Sorted)
    36270   top_dir         ansic=35504,sh=460,perl=306
    28825   win32           ansic=28716,asm=109
    9692    tests           ansic=8806,asm=858,sh=28
    2395    lib             ansic=2252,asm=143
    158     include         ansic=158
    140     examples        ansic=140


    Totals grouped by language (dominant language first):
    ansic:        75576 (97.54%)
    asm:           1110 (1.43%)
    sh:             488 (0.63%)
    perl:           306 (0.39%)

As far as I can tell, the top-level directory is the actual compiler itself. 35k lines is pretty small for a real C compiler that can bootstrap GCC (the versions that were still written in pure C).

vkazanov · on Oct 10, 2019

What I mean is what I said :-) 70k is not tiny. And one cannot just throw away a backend and say that it's not part of the compiler.

Compare it with a few others:

8cc/9cc - 10K LOC cproc - 7K LOC lcc - 30K LOC

tom_mellior · on Oct 10, 2019

> And one cannot just throw away a backend and say that it's not part of the compiler.

OK. Let me rephrase: TCC doesn't have 35k lines of code, but with five minutes' work it could be turned into a compiler with 35k lines of code capable of bootstrapping GCC on Linux. That should be enough to compare it to the ones you list.

vkazanov · on Oct 10, 2019

That's an interesting way to put it. :-D I believe we could do that to a lot of other compilers, say, lcc (30k) with its multiple backends and get something like 10-20k.

Anyways, what I was trying to say is that "tiny" in "tinycc" has lost its meaning already.

userbinator · on Oct 10, 2019

cproc is not 7K self-contained either; from the GitHub page: "cproc is a C11 compiler using QBE as a backend".

vkazanov · on Oct 10, 2019

that's correct. But even if you add qbe you only get 12k more, or about 20k, not 70k.

megous · on Oct 10, 2019

The referenced compilers are a different level of tiny, like 500 lines of tiny. :)

ainar-g · on Oct 10, 2019

Ah, I see. tcc is still tiny compared to clang and gcc though :-)

agumonkey · on Oct 10, 2019

I wonder if anybody ported tcc-linux

paulriddle · on Oct 10, 2019

There is also https://git.sr.ht/~mcf/cproc inspired by several other small C compilers including 8cc, c, lacc, and scc. I did not take a deep look at it yet, but it looks interesting.

guidedlight · on Oct 10, 2019

I wonder why LCC isn’t mentioned. https://en.wikipedia.org/wiki/LCC_(compiler)

MrXOR · on Oct 10, 2019

and https://bellard.org/tcc

mhd · on Oct 10, 2019

Weirdly enough, it's mentioned in the "Past Research" section above, but not listed amongst the compilers. Maybe not small enough?

andrewchambers · on Oct 10, 2019

This one is fantastic: https://github.com/michaelforney/cproc

pepijndevos · on Oct 10, 2019

Say I wrote my own CPU (which I did), which of those tiny C compilers would be easiest to retarget?

My CPU is very small, (~300 LUT4 on an FPGA using Yosys), but has a very minimal ISA. It's mostly an accumulator machine with a stack pointer.

nils-m-holm · on Oct 10, 2019

SubC is very easy to retarget, especially the book version. It is not a full C compiler, though.

http://t3x.org/subc/

Rietty · on Oct 10, 2019

I'm interested in what you mean by "wrote my own CPU". Could you elaborate please?

pepijndevos · on Oct 10, 2019

https://github.com/pepijndevos/seqpu/blob/master/cpu.vhd

userbinator · on Oct 10, 2019

Unfortunate that "CUCU", "C Interpreter", and "Small C for I386 (IA-32)" are already dead links, although the page history shows that it was created only a little over 2 years ago.

zserge · on Oct 10, 2019

Sorry, blog layout has changed since then. https://zserge.com/posts/cucu-part1/

kragen · on Oct 10, 2019

You might want to fix the links so the old URLs work again.

jokoon · on Oct 10, 2019

I've tried to use TCC to see if it's possible to use it as a scripting language in a C++ project, and it seems to work pretty well.

(I've heard about chaiscript, but it seems way too big)

makapuf · on Oct 10, 2019

Genuine curiosity, it seems smaller than tcc? Is it by loc, by runtime mem used, by exec size ?

enriquto · on Oct 10, 2019

There's also this amazing thing: http://www.simple-cc.org/

vector_spaces · on Oct 10, 2019

Anyone know what they're using for the repo web UI? http://git.simple-cc.org/scc/

naters · on Oct 10, 2019

Pretty sure it's this: https://git.codemadness.org/stagit/

enriquto · on Oct 10, 2019

no idea, but this is a lovely, clean interface. I particularly like that file sizes are in lines

Crinus · on Oct 10, 2019

These look neat but they all seem to be targeting a subset of C (e.g. no struct support) instead of the full thing (even if we include stdlib). Only exception seems to be SmallerC (which as expected is larger than the other compilers/interpreters that only support a subset).

rswier · on Oct 10, 2019

I made a branch of c4 that includes struct support. I should probably add it to the master branch so it gets more visibility: https://github.com/rswier/c4/blob/switch-and-structs/c4.c