This seems to me like a necessary complement to the efforts around making packages build reproducibly.
It has long been accepted that if a piece of software requires a non-Free compiler to build it, then that piece of software is de facto non-Free too. Taking that to its logical extreme, a piece of software isn't Free unless it can be built by a compiler whose recursive sequence of meta-compilers leads back to a minimal, audited binary seed.
Fortunately, once this has been done once (or multiple times independently, producing the same results), all future compilations and software can potentially enjoy the benefits of the process.
> It consists of a mutual self-hosting Scheme interpreter written in
~5,000 LOC of simple C and a Nyacc-based C compiler written in Scheme.
I’m trying to understand how important this part is. Is there a fundamental reason this needed to be a C compiler and Scheme interpreter that each can compile or interpret the other? Or is it just that they needed to support those two languages to bootstrap this particular software distribution?
Ultimately, the goal is to compile a trusted copy of the gcc source code with a trusted compiler. This means that you need a trustworthy c compiler binary. In order for the compiler binary to be trustworthy, you have to either:
1. read and understand the machine code of the compiler's binary, or
2. have an observable compiler that can compile itself and also gcc (or at least better compilers that can compile gcc).
This project uses approach 2 - an interpreter to interpret a compiler compiling it's own interpreter, and also a better compiler. Because the compiler is interpreted, you can potentially intervene and observe the compilation progress. (To watch a compiled binary of a compiler compiling itself, you would need a trustworthy debugger binary, which you can't bootstrap unless you can already trust some other compiler, or read+understand the debugger's binary.)
Because the interpreted compiler can compile its own interpreter, and the interpreter interprets its own compiler, you can read and understand the source code of each, observe the interpreted compiler compile the interpreter, verify that the result is bit-for-bit identical to what you have been running, and now you have an executable compiler that you can trust.
So, having a mutually co-hosting compiler and interpreter is a neat way of closing the compiler trust loop -- instead of always having to trust some earlier compiler, this allows you to establish trust (or distrust) of a pair of programs by running them on themselves and comparing what you get to what you started with. The fact that the interpreter is written in scheme is incidental to the fact that this project is intended for GuixSD, a linux distribution that uses a package manager written and configured in scheme. I assume scheme was chosen because
1. people who work on GuixSD like scheme, and are comfortable reading and writing it, and
2. scheme is a language with a good complexity/power ratio, minimizing the amount of code that has to be written, read, and understood in order to trust the compiler toolchain.
The fact that the compiler is a c compiler is more important, because with a trusted c compiler you can compile (old versions of) gcc. With only a trusted X compiler (where X is not C or C++), you would need to add another step, a C compiler written in X. This would still be feasible, but adds more code to the already-enormous pile that needs to be trusted.
Has anyone considered bootstrapping on completely different stacks and then diffing the results?
Maybe even using cross compilers.
I can imagine hostile actors compromising various Intel, AMD, ARM chips. But I can't imagine those compromised targets all behaving the same way. Or going back in time to port their compromises to obsolete architectures.
This seems like a really neat idea! I think it would be cool to use plain source code to completely bootstrap a system. But, at the same time, this sounds like a maintenance nightmare! Maintaining older or smaller versions of software just to avoid self-hosting is bound to hit all kinds of bugs - big and small.
At the same time, I'm not convinced this completely solves the problem. Maybe you can trust your C compiler now, but can you trust your Bash interpreter, or your sed command? What stops them from injecting things into your C compiler while it's building. At the end of the day, you have to trust some software, why not include your C compiler in this list?
Anyway, I am much more concerned about much higher places in free software bootstrapping. Lots of software has switched from Autotools to Meson without thinking about how we will bootstrap this stuff. Meson requires you to have Ninja to work, but how can I bootstrap Ninja without Python? Autotools is admittedly bad, but at least you only needed bash & make for it to work.
> Maintaining older or smaller versions of software just to avoid self-hosting is bound to hit all kinds of bugs - big and small.
Unfortunately, any compiler that can compile current versions of GCC would probably be enormous, and difficult to trust directly. C++ is a complicated language to compile.
As long as the chosen versions of TCC and GCC
1. are free of bugs that affect compiling the next-more-sophisticated compiler, and
2. can be trusted,
then having them as intermediate compilers isn't too terrible. Bugs in them that affect compiling the next-more-sophisticated compiler can be written and committed now, and will remain that way forever. The only place new bugs can be encountered is if the current version of GCC changes so that it can't be compiled by the last-frozen version of GCC. There are two ways to fix that:
1. Patch the last-frozen version of GCC so that it can still compile the latest GCC, or
2. add another intermediate version of GCC to the toolchain, since the last-frozen GCC could compile every version of GCC up to the most current, and the most current GCC must be compilable by at least one earlier version of GCC.
This means that the compiler chain can potentially grow without bound, but it would do so slowly, and only by adding more compiler code onto the end of already-trusted compiler code. A long chain of obsolete-but-trusted compilers is not ideal, but should be a working and stable source of trust.
Well you could have an initial stage in the bootstrapping that is pure machine code, but is short and simple enough that it can be read and understood. Then that could implement a very simple assembler. You'd write a complex assembler in the simple assembly language. Then write a few steps of increasingly complicated compilers.
> This seems like a really neat idea! I think it would be cool to use
> plain source code to completely bootstrap a system.
Technically that's what we are working towards. Idealogically, we are
working towards moving awarenes in our communities from `neat idea' to
`unbelievable that in 2018 we still used and trusted computers that
had no bootstrappable audit trail!'
> But, at the same time, this sounds like a maintenance nightmare!
> Maintaining older or smaller versions of software just to avoid
> self-hosting is bound to hit all kinds of bugs - big and small.
You are not suggesting that we stop our efforts because it will
require work and may contain bugs, rigth? ;-)
Until now, this has been the effort of a very small team. The choice
for tcc-0.9.26 and gcc-2.95.3 was a pragmatic one. We hope that when
awareness of bootstrappable software rises we can make some better
choices.
I would love for the TinyCC and GNU GCC developers (any software
developers, really) to take the lead in creating their own
bootstrappable stories; we just showed it can be done.
> At the same time, I'm not convinced this completely solves the
> problem. Maybe you can trust your C compiler now, but can you trust
> your Bash interpreter, or your sed command?
Right, doing only part of the work does not solve the problem. The
solution will be built from a number of such steps. One of those
steps may involve hardware; I'm sure it's a long path.
> What stops them from injecting things into your C compiler while
> it's building. At the end of the day, you have to trust some
> software, why not include your C compiler in this list?
Some of us have decided that is just not good enough, that we would
like our softwares not only reproducible but also bootstrappable, and
are we working towards that.
There is Stage0 and M2-Planet to take care of what's below Mes.
We have also started work on a Bash replacement in Scheme (using GNU
Guile initially) that comes with a minimal implementation of
coreutils, grep, sed, tar. I have managed to build GNU make and Bash
using that: https://twitter.com/janneke_gnu/status/1070434782973063168
> Anyway, I am much more concerned about much higher places in free
> software bootstrapping. Lots of software has switched from Autotools
> to Meson without thinking about how we will bootstrap this
> stuff. Meson requires you to have Ninja to work, but how can I
> bootstrap Ninja without Python?
Surely that the technical part of that problem can be solved quite
easily once the respective developers of those projects are becoming
aware of bootstrapping and reproducibility and decide to give it
priority?
> CakeML is a functional programming language and an ecosystem of proofs and tools built around the language. The ecosystem includes a proven-correct compiler that can bootstrap itself.
It has long been accepted that if a piece of software requires a non-Free compiler to build it, then that piece of software is de facto non-Free too. Taking that to its logical extreme, a piece of software isn't Free unless it can be built by a compiler whose recursive sequence of meta-compilers leads back to a minimal, audited binary seed.
Fortunately, once this has been done once (or multiple times independently, producing the same results), all future compilations and software can potentially enjoy the benefits of the process.