Indeed! An eventual goal of Onramp is to bootstrap in freestanding so we can boot directly into the VM without an OS. This eliminates all binaries except for the firmware of the machine. The stage0/live-bootstrap team has already accomplished this so we know it's possible. Eliminating firmware is platform-dependent and mostly outside the scope of Onramp but it's certainly something I'd like to do as a related bootstrap project.
A modern UEFI is probably a million lines of code so there's a huge firmware trust surface there. One way to eliminate this would be to bootstrap on much simpler hardware. A rosco_m68k [1] is an example, one that has requires no third party firmware at all aside from the non-programmable microcode of the processor. (A Motorola 68010 is thousands of times slower than a modern processor so the bootstrap would take days, but that's fine, I can wait!)
Of course there's still the issue of trusting that the data isn't modified getting into the machine. For example you have to trust the tools you're using to flash EEPROM chips, or if you're using an SD card reader you have to trust its firmware. You also have to trust that your chips are legit, that the Motorola 68010 isn't a modern fake that emulates it while compromising it somehow. If you had the resources you'd probably want to x-ray the whole board at a minimum to make sure the chips are real. As for trusting ROM, I have some crazy ideas on how to get data into the machine in a trustable way, but I'm not quite ready to embarrass myself by saying them out loud yet :)
Author here. I think my opinion would be about the same as the authors of the stage0 project [1]. They invested quite a bit of time trying to get Forth to work but ultimately abandoned it. Forth has been suggested often for bootstrapping a C compiler, and I hope someone does it someday, but so far no one has succeeded.
Programming for a stack machine is really hard, whereas programming for a register machine is comparatively easy. I designed the Onramp VM specifically to be easy to program in bytecode, while also being easy to implement in machine code. Onramp bootstraps through the same linker and assembly languages that are used in a traditional C compilation process so there are no detours into any other languages like Forth (or Scheme, which live-bootstrap does with mescc.)
tl;dr I'm not really convinced that Forth would simplify things, but I'd love to be proven wrong!
To add a bit to this, although Dusk OS doesn't have the same goals as stage0, that is to mitigate the "trusting trust" attack, I think it effectively does it. Dusk OS kernels are less than 3000 bytes. The rest boots from source. One can easily audit those 3000 bytes manually to ensure that there's nothing inserted.
That being said, the goal of stage0 is to ultimately compile gcc and there's no way to do that with Dusk OS.
That being said (again), this README in stage0 could be updated because I indeed think that Dusk is a good counterpoint to this critique of Forth.
Oh, amazing! I've heard of DuskOS before but I didn't realize its C compiler was written in Forth.
Looks like it makes quite a few changes to C so it can't really run unmodified C code. I wonder how much work it would take to convert a full C compiler into something DuskCC can compile.
One of my goals with Onramp is to compile as much unmodified POSIX-style C code as possible without having to implement a full POSIX system. For example Onramp will never support a real fork() because the VM doesn't have virtual memory, but I do want to implement vfork() and exec().
It can't compile unmodified C code targeting POSIX. That's by design. Allowing this would import way too much complexity in the project.
But it does implement a fair chunk of C itself. The idea is to minimize the magnitude of the porting effort and make it mechanical.
For example, the driver the the DWC USB controller (the controller on the raspberry pi) comes from plan 9. There was a fair amount of porting to do, but it was mostly to remove the unnecessary hooks. The code itself, where the real logic happens, stays pretty much the same and can be compiled just fine by Dusk's C compiler.
That might be rather more difficult than you might expect. Advent of Code uses quite a lot of 64-bit numbers. A bit of googling tells me C64 BASIC only supports 16-bit integers and 32-bit floats. I imagine the other BASICs have similar limitations.
I did 2023 Advent of Code with my own compiler and this was the biggest challenge I ran into. I only had 32-bit integers at the time so I had to manually implement 64-bit math and number formatting within the language to be able to do the puzzles. You would probably have to do the same in BASIC.
Commodore BASIC, derived from Microsoft's 6502 BASIC, actually has 40-bit floats, with a 32-bit mantissa and 8-bit exponent, not that it would help much, if any, with 64-bit maths.
There are Microsoft BASICs that have 64-bit floats, such as built into ROM on the TRS-80 Model I, III and 4 w/Level 2 BASIC, TRS-80 Model 100/102, TI-99/4(a), Apple III, and MSX systems, or on cartridge such as Microsoft BASIC for the Atari 8-bit computers.
It can be difficult to explain why bootstrapping is important. I put a "Why?" section in the README of my own bootstrapping compiler [0] for this reason.
Security is a big reason and it's one the bootstrappable team tend to focus on. In order to avoid the trusting trust problem and other attacks (like the recent xz backdoor), we need to be able to bootstrap everything from pure source code. They go as far as deleting all pre-generated files to ensure that they only rely on things that are hand-written and auditable. So bootstrapping Python for example is pretty complicated because the source contains code generated by Python scripts.
I'm much more interested in the cultural preservation aspect of it. We want to preserve contemporary media for future archaeologists, for example in the Arctic World Archive [1]. Unfortunately it's pointless if they have no way to decode it. So what do we do? We can preserve the specs, but we can't really expect them to implement x265 and everything else they would need from scratch. We can preserve binaries, but then they'd need to either get thousand-year-old hardware running or virtualize a thousand-year-old CPU. We can give them, say, a definition of a simple Lisp, and then give them code that runs on that, but then who's going to implement x265 in a basic Lisp? None of this is really practical.
That's why in my project I made a simple virtual machine, then bootstrapped C on top of it. It's trivially portable, not just to present-day architectures but to future and alien architectures as well. Any future archaeologist or alien civilization could implement the VM in a day, then run the C bootstrap on it, then compile ffmpeg or whatever and decode our media. There are no black boxes here: it's all debuggable, auditable, open, handwritten source code.
The minimum tool that bootstrapping projects tend to start with is a hex monitor. That is, a simple-as-possible tool that converts hexadecimal bytes of input into raw bytes in memory, and then jumps to it.
You need some way of getting this hex tool in memory of course. On traditional computers this could be done on front panel switches, but of course modern computers don't have those anymore. You could also imagine it hand-woven into core rope memory for example, which could then be connected directly to the CPU at its boot address. There are many options here; getting the hex tool running is very platform-specific.
Once you have a hex tool, you can then use that to input the next stage, which is written in commented hexadecimal source code. The next tool then adds a few features, and so does the tool after that, and so on, eventually working your way up to assembly and C.
From the point of view of trust and security, bootstrapping has to be something that's easily repeatable by everyone, in a reasonable amount of time and steps, with the same results.
Not to mention using only the current versions of all the deliverables or at most one version back.
While it is technically possible to bootstrap Rust from Guile and the 0.7 Rust compiler, you would need to recompile the Rust compiler about a hundred times. Each step takes hours, and you can't skip any steps because, like he said, 1.80 requires 1.79, 1.79 requires 1.78, and so on all the way back to 0.7. Even if fully automated, this bootstrap would take months.
Moreover, I believe the earlier versions of rustc only output LLVM, so you need to bootstrap a C++ compiler to compile LLVM anyway. If you have a C++ compiler, you might as well compile mrustc. Currently, mrustc only supports rustc 1.54, so you'd still have to compile through some 35 versions of it.
None of this is practical. The goal of Dozer (this project) is to be able to bootstrap a small C compiler, compile Dozer, and use it to directly compile the latest rustc. This gives you Rust right away without having to bootstrap C++ or anything else in between.
This is accurate. I'm an OS/kernel developer and a colleague was given the task of porting rust to our OS. If I remember correctly, it did indeed take months. I don't think mrustc was an option at the time for reasons I don't recall, so he did indeed have to go all the way back to the very early versions and work his way through nearly all the intermediate versions. I had to do a similar thing porting java, although that wasn't quite as annoying as porting rust. I really do wish more language developers would provide a more practical way of bootstrapping their compilers like the article is describing/attempting. I've seen some that do a really good job. Others seem to assume only *nix and Windows exist, which has been pretty frustrating.
I'm curious as to why you need to bootstrap at all? Why not start with adding the OS/kernel as a target for cross-compilation and then cross-compile the compiler?
The article mentions that the Bootstrappable Builds folks don't allow pre-generated code in their processes, they always have to build or bootstrap it from the real source.
that's interesting! what kind of os did you write? it sounds like you didn't think supporting the linux system call interface was a good idea, or perhaps even feasible?
It's got a fairly linux like ABI, though we don't aim or care to be 1-1 compatible, and it has/requires our own custom interfaces. Porting most software that was written for linux is usually pretty easy. But we can't just run binaries compiled for linux on our stuff. So for languages that require a compiler written in its own language where they don't supply cross compilers or boot strapping compilers built with the lowest common denominator (usually c or c++), things can get a little trickier.
The current version of rustc may compile itself quickly, but remember, this is after nearly ten years of compiler optimizations. Older versions were much slower.
I seem to recall complaints that old rustc would take many hours to compile itself. Even if it takes on average, say, two hours to compile itself, that's well over a week to bootstrap all the way from 0.7 to present. You're right that months is probably an exaggeration, but I suspect it might take a fair bit longer than a week. The truth is probably somewhere in the middle, though I suppose there's no way to know without trying it.
> Moreover, I believe the earlier versions of rustc only output LLVM, so you need to bootstrap a C++ compiler to compile LLVM anyway. If you have a C++ compiler, you might as well compile mrustc. Currently, mrustc only supports rustc 1.54, so you'd still have to compile through some 35 versions of it.
Not sure I follow - isn't rustc still only a compiler frontend to LLVM, like clang is for C/C++? So if you have any version of rustc, haven't you at that point kind of "arrived" and started bootstrapping it on itself, meaning mission complete?
Ultimately from what I glean the answer really is just that this would be made nicer with Dozer, but I still wish this was explicitly stated by the author in the post. It's not like the drudgery of the ocaml route escapes me.
The team at bootstrappable.org have been working very hard at creating compilers that can bootstrap from scratch to prevent this kind of attack (the "trusting trust" attack is another name for it.) They've gotten to the point where they can bootstrap in freestanding so they don't need to trust any OS binaries anymore (see builder-hex0.)
I've spent a lot of my spare time the past year or so working on my own attempt at a portable bootstrappable compiler. It's partly to prevent this attack, and also partly so that future archaeologists can easily bootstrap C even if their computer architectures can't run any binaries from the present day.
It's nowhere near done but I'm starting a new job soon so I felt like I needed to publish what I have. It does at least bootstrap from handwritten x86_64 machine code up to a compiler for most of C89, and I'm working on the final stage that will hopefully be able to compile TinyCC and other similar C compilers soon.
So bootstrap in freestanding does make this kind of attack much more difficult to pull off, but with contemporary hardware, it does not fully prevent the attack.
What if the trojan is in microcode? No amount of bootstrap in freestanding can protect you here.
It is true that there are many layers of code below the OS level. UEFI for example is probably hundreds of thousands of lines of compiled code. Modern processors have Intel IME and equivalent with their own secret firmware. Almost all modern peripherals will have microcontrollers with their own compiled code.
These are all genuine attack vectors but they are not really solvable from the software side. At least for Onramp I consider these problems to be out of scope. It may be possible to solve these with open hardware but a solution will look very different from the kind of software bootstrapping we're doing.
Correct me if I’m wrong, but isn’t this recreating a thing that used to exist? I have memories of being told of a compiler older than GCC that could compile itself using… I want to say a bash script. It took forever to run because you had to run the script which of course was slow, and then it output a completely unoptimized compiler. And if memory serves that output didn’t have any of the optimization logic in it. So you had to compile it again to get the optimizer passes to be compiled in, then compile it again to get a fast compiler (self optimization).
> That’t not "doing sketchy shit with people's data"
Of course it is. If you add AdSense to your website you are letting Google track your users in exchange for a cut of the profits. Of course you should have to warn your users that they are being tracked at the very least.
I was definitely hoping for a look at modern compilers. This article was written last month, yet its history ends with the release of LLVM. There's quite a lot of development in small C compilers lately!
- TinyCC, SCC (Simple C Compiler) and Kefir are all fairly serious projects in active development.
- QBE is a new optimizing backend much simpler than LLVM; cproc and cparser are two of the C compilers that target it, in addition to its own minic.
- There's the bootstrapping work of stage0, M2-Planet, and mescc.
- chibicc is an educational C compiler based on the author's earlier 8cc compiler. The author is writing a book about it.
- lacc is another simple compiler that works well, although development appears to have stalled.
I think a lot of these projects are inspired by the problem that GCC and Clang/LLVM are now gigantic C++ projects. A pure C system like OpenBSD or Linux/musl/BusyBox ought to be able to compile itself without relying on C++. We really need a production quality open source C compiler that is actually written in C. I'm hopeful one of these compilers can achieve that someday.
- Though freeware and not foss, there is also pellesc over at Windows land, with almost full C23 support.
- For small 8 bit systems, SDCC is an excellent choice, supporting even C23 features! Also its lead maintainer is a committee member with really useful contributions to the standard.
- I have heard the RiscOS compiler is pretty cool and supports modern standards. That one uses the Norcroft frontend.
I agree with you in that we need a production level C compiler written in C. Though that is not a simple task and the C community nowadays prefers to engage on infighting over pedantic issues or rust rather than working together. A simple example of this is the lack of a modern library ecosystem, while everyone and their mother has their own custom build system. Even though C is sold as a performant language, there isn't a single parallelism library like OneTBB, Kokkos or HPX over at C++. Don't get me started on vendors not offering good standard support (Microsoft, macos-libc, OpenBSD libc)...
One correction though, cparser uses libfirm as a backend, not qbe. Also the author of chibicc has stopped writing that book AFAIK.
Bonus non-c based entries:
- The zig community is working on arocc. Judging by the awesomeness of zig cc, these are really good news.
- Nvidia offers their EDG based nvc with OpenACC support for free these days, which is cool.
"This work is based on the observation that in cases where returning a stack-allocated value is desired, the value’s lifetime is typically still bounded, it just needs to live a little bit longer than the procedure that allocated it. So, what would happen if we just don’t pop the stack and delay it until one of the callers resets the stack, popping multiple frames at once? It turns out that this surprisingly simple idea can be made to work rather well."
> QBE is a new optimizing backend much simpler than LLVM; cproc and cparser are two of the C compilers that target it, in addition to its own minic.
I thought cparser targeted libFirm. That's what their GitHub page says [0].
"It acts as a frontend to the libFirm intermediate representation library."
> We really need a production quality open source C compiler that is actually written in C.
I honestly think cproc or cparser are almost there already. For cproc, you just need to improve the quality of code optimization; it's really QBE you'd need to change. For example, you could change unnecessary multiplications by powers of 2 into left shifts (edit: IDK if it's cproc or QBE that's responsible for this, actually), and you could improve instruction selection so that subtraction is always something like "sub rax, rdi" and not "neg rdi / add rax, rdi" [1]). It also doesn't take advantage of x86-64 addressing, e.g. it will do a separate addition and then a multiplication instead of "mov eax, [rdi + rsi * 8]".
For cparser, I notice slightly higher quality codegen; libFirm just needs more architecture support (e.g. AMD64 support appears to work for me, but it's labeled as experimental).
You have the cproc compiler which does use the QBE backend. It generates much faster code than tcc since there are some basic optimization passes. On bz2 compression, with crude and basic testing, I got ~70% of the speed of gcc 13.1. tcc code is really slow, I am thinking of a QBE backend for tcc.
I would use that everywhere instead of the grotesquely and absurdely massive and complex gcc (and written in that horrible c++!). I would re-write in assembly some code hot spots. But it means those extra ~30% performance are accutely expensive, at least they could have been carefull to keep gcc written in simple and plain C99 (with benign bits of c11) to reduce the technical cost. Yeah, switching gcc to c++ is one of the biggest mistakes in open source software ever (hopefully, that mistake is not related to b. gates donations to MIT media labs revealed by the pedophile Epstein files... if all that is true though not to mention that would explain the steady ensh*tification of GNU software).
The problem is linux which does require bazillions of gcc extensions to even compile correct kernel code nowdays. You can see clang (which is no better, actually even worse) playing catchup with gcc for all the extensions creeps the kernel is getting.
All that stinks corpo-backed planned obsolescence, aka some kind of toxic worldwide scam.
8BitDo makes great controllers. I have an SN30 Pro, a Zero 2 and a pair of Ultimate Cs. All of them have been used extensively and they are all excellent quality.
On the other hand I've had back luck with generic NES and SNES knockoff USB controllers. The quality is much worse than the originals especially in the D-Pad. It seems nobody but 8BitDo can get this right.
If you stick with 8BitDo you'll have great quality but they don't necessarily match the form factor of the originals. I can see why OP would want to convert a real one.
The form factor is really close though. Ignoring the added buttons, which is completely possible while gaming, it does not feel much different. https://www.onli-blogging.de/uploads/sf30pro.jpg for an example picture :)
A modern UEFI is probably a million lines of code so there's a huge firmware trust surface there. One way to eliminate this would be to bootstrap on much simpler hardware. A rosco_m68k [1] is an example, one that has requires no third party firmware at all aside from the non-programmable microcode of the processor. (A Motorola 68010 is thousands of times slower than a modern processor so the bootstrap would take days, but that's fine, I can wait!)
Of course there's still the issue of trusting that the data isn't modified getting into the machine. For example you have to trust the tools you're using to flash EEPROM chips, or if you're using an SD card reader you have to trust its firmware. You also have to trust that your chips are legit, that the Motorola 68010 isn't a modern fake that emulates it while compromising it somehow. If you had the resources you'd probably want to x-ray the whole board at a minimum to make sure the chips are real. As for trusting ROM, I have some crazy ideas on how to get data into the machine in a trustable way, but I'm not quite ready to embarrass myself by saying them out loud yet :)
[1]: https://rosco-m68k.com/