Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Tiny C Compiler (bellard.org)
307 points by Koshkin on June 23, 2020 | hide | past | favorite | 106 comments


It's not just a tiny C compiler - it's a compiler you can use as a static library that can compile your C code directly to memory. And then you can call into it, if you dare.

I used it as a scripting backend for a long time, but eventually you will realize that you are not the only one that could write scripts, but you also really need to have a trust boundary (or just a different address space), so that errors in the script don't drag the whole thing down.

That's where Lua and LuaJIT shines: It's simple, and it's sandboxed. However, there is still one thing missing. These days, several programming languages have several targets that they can compile to, and so imagine if you could emulate some platform at high performance with low overhead, in a sandbox. You would then be able to script in whatever language you so desire, provided it has that target.

Unfortunately, those languages tend to be system languages and not the absolute best for scripting. With one big exception: Compile-time programming support.


LuaJIT is not sandboxed.

Mike Pall told me this in no uncertain terms: http://lua-users.org/lists/lua-l/2011-02/msg01582.html

> The only reasonably safe way to run untrusted/malicious Lua scripts is to sandbox it at the process level. Everything else has far too many loopholes.


The issue isn't that Lua can't nominally provide a sandboxed environment--it can, better than almost any other language. The central issue is whether LuaJIT, PUC Lua, or any other particular piece of software can be made sufficiently free of bugs that you can trust such a sandbox to run potentially malicious code.

The answer in the case of LuaJIT is definitely no, because the JIT engine is sufficiently complex that exploits are inevitable. Note that this is also the case with JavaScript. Many browser exploits start with some codegen bug in V8 or SpiderMonkey. And there are many more eyeballs looking to fix bugs in V8 and SpiderMonkey than LuaJIT, so in the case of LuaJIT the prudent answer is that you should never trust it to run potentially malicious code.

The case for PUC Lua is more nuanced. Lua used to have a bytecode verifier, but it was removed for 5.2 because too many bugs were found in the verifier, and because the VM relied heavily on the verifier to filter bad opcode patterns, that led to sandbox breakouts. This made the PUC developers believe that the verifier made Lua worse off as compared to a more wholistic emphasis on correctness and robustness, so they dropped the verifier. They also dropped any pretense that you could safely sandbox bytecode (i.e. precompiled scripts); if you want a sandbox in Lua you should only load untrusted code as plain Lua scripts into the sandboxed environment. To that end Lua 5.2 added a parameter to all APIs for loading code that specified whether to accept text scripts or binary bytecode. In other words, the bytecode verifier was considered a third wheel, so they removed it and redirected attention to the compiler and the rest of the VM.

So for PUC Lua the issue really comes down to how prudent it is to draw a trust boundary around a pure Lua sandbox; or rather, what your adversarial model is, precisely. PUC Lua is committed to Lua's sandboxing features, but many developers are fairly of the opinion that the only way to run untrusted code, if you're to run untrusted code at all, is using either a hardware VM or a very strict seccomp jail. If you're of the latter opinion, the language is irrelevant--you shouldn't trust Lua, JavaScript, Java, or any other language environment, period. In practice, however, even people of the latter opinion generally apply the principle of defense in depth. That's why browser JavaScript APIs and capabilities are still relatively limited, even though browsers execute JavaScript in OS-based sandboxes. In most practical contexts Lua's sandboxing features still provide great value; it just needs to be understood that they're a complement to rather than substitute for process sandboxing.

The same analysis applies to WebAssembly, FWIW, especially JIT'ing WASM environments. Anybody who thinks WASM is a magic cureall for running untrusted code is mistaken.


These types of comments are why I still read HN.


Research projects like RockJIT [0] can introduce a more structured approach to some aspects of JIT security. To my knowledge though it hasn't seen real-world use.

Presumably a formal approach could also work. CompCert exists, after all.

[0] https://dl.acm.org/doi/pdf/10.1145/2660267.2660281


I see, that's unfortunate. I don't have those issues in my emulated environment though. I don't use Lua anymore - I'm running emulated RISC-V in my own emulator.

I have been benchmarking against LuaJIT for the longest time because I thought it was an equal in that respect. Guess it should have been against regular Lua.


For what it's worth, I think his statement applies to regular Lua too, see: http://lua-users.org/lists/lua-l/2011-02/msg01595.html


I think there’s a difference between trying to prevent untrusted code from escaping a sandbox, and trying to prevent it from locking up the CPU.

Lua is commonly used for scripting games. If I load an untrusted script into the game I’m playing and it can access all of my files, it’s potentially catastrophic. OTOH if the worst an untrusted script can do is use 100% CPU (and cause the game I’m playing to lock up so I need to restart), it’s a minor annoyance.

AFAIK in the absence of bugs (and regular Lua is pretty close to that [0]), Lua’s sandboxing features are sufficient to protect against the former but not the latter. Am I right to think that’s often good enough? It seems a huge improvement over embedding, say, Python - I wonder how it compares to a embedding a JavaScript engine like V8?

[0] https://www.lua.org/bugs.html


A small history of LuaJIT sandbox escapes:

- 5.2, 2016: https://apocrypha.numin.it/talks/lua_bytecode_exploitation.p... (9MB PDF)

- 5.2, 2016: https://github.com/erezto/lua-sandbox-escape

- 5.1, 2015: https://www.corsix.org/content/malicious-luajit-bytecode (warning: dense)

There's a "luarop" link (boop.i0i0.me/blog.lua/luarop) referenced in the PDF, but the link sadly seems to have died (IA never crawled the domain).


The first two of those seem to be for regular Lua, not LuaJIT.

Do any of them work without needing to load arbitrary bytecode, which is known to be insecure?


Oh duh, facepalm. The significance of that tidbit was completely lost on my remembering...

And presumably not.


You can see this technique being used in TCCBOOT, using tcc as a boot loader to compile A Linux kernel into memory and then run it in a handful of seconds.

https://bellard.org/tcc/tccboot.html


Yeah well, modern Linux cannot be compiled in 15 seconds. The link is from 2004. OTOH, I wonder how quickly modern Ryzen would be able to boot 2004 kernel from the source.


The Threadripper 3970x can compile the 5.x kernel in ~24 seconds. That's not too shabby. https://www.phoronix.com/scan.php?page=article&item=amd-linu...


Kernel compile time has apparently stayed pretty static over the years (or at least tracked with perf improvements). That being said, I think that takes into account SMP which probably wouldn't be easily accessible to tccboot.


I feel like embedding a web assembly runtime and then exposing scripting API to it would give you similar benefits (you would have to add a compile C to WASM step before loading) while giving you a really secure sandbox.

It is more complicated than just embedding a scripting language (or a scripting compiler :)) but more general and secure.

Especially if WASM evolves to a point where you can compile C# and other high level languages to it (you can to a point already but it's not on par with native runtimes) - it would be the most general embedding runtime.


Is a wasm runtime complex to write? Not with jit or anything fancy, is it at the level of making a GB emulator or a llvm backend ?


> and so imagine if you could emulate some platform at high performance with low overhead, in a sandbox. You would then be able to script in whatever language you so desire, provided it has that target.

My understanding is that this is what Google's PNaCl was about (https://developer.chrome.com/native-client/nacl-and-pnacl) : leveraging LLVM bitcode for both portability (source and target) and native performance, being able to sandbox properly the resulting executable.


Have you actually gotten the static library and compile to memory mode to work?


Yes, I used it in a game engine for many years, on Windows and Linux. No problems on that part.

I don't really recommend doing it, unless you want to do it as a learning experiment. It was a fun thing to do.


I got it to work as a DLL for calling into C from common-lisp when I needed something that was header-file dependent. It was really just a proof of concept though because if you distribute source, you can require that they have CC installed just as easily as libtcc, and if you distribute a binary, you can grovel ahead of time and don't need libtcc...


I have tried the latter and it seems to work fine.


The author is a genius: a while ago he implemented an x86 virtual machine in javascript: https://bellard.org/jslinux/vm.html?url=buildroot-x86.cfg


>The author is a genius

Understatement of the century! :-) The breadth of his work, all non-trivial, is truly humbling. All done without fanfare or self-aggrandizement. I can't even find a good interview of him to understand his mind and thought-process.


Is there a good magazine, news outlet or blog we could tip off, and try to get them to make an interview?


The usual suspects: Reddit, Slashdot, ArsTechnica and of course HN.

And the interview should be purely on the technical side i.e his educational background, how he approaches design and programming, how he learns new technical domains, his thoughts on languages, advice to young programmers etc. Given his off the charts productivity, i suspect he has a highly efficient way of learning and doing things which i would like to copy :-)


He created FFMpeg, QEMU and claimed the pi digits record... A legend.


He also created LZEXE in the late 80s, one of the first executable compressors: https://bellard.org/lzexe.html


Totally second that. Just think for a moment how much of modern computing relies on his work even just looking at FFPMEG and QEMU. The guy should get a United Nations medal and pile of gold for services to (computing) humanity.


...and as if that wasn't enough, he also wrote a JavaScript interpreter: https://bellard.org/quickjs/


... and has contributed to the state of the art of computing Pi’s digits - IIRC, he found a previously unknown expansion of the digits of Pi, and used it to calculate more digits than anyone before him.


Want to "skip" the compile step?

  //usr/bin/tcc -run $0; exit
  main() { printf("Hello\n"); return 0; }
Just `chmod +x` and run it...


It's obviously not the same, but you can emulate this with a conventional compiler:

  #if 0
  set -e; [ "$0" -nt "$0.bin" ] &&
  cc "$0" -o "$0.bin"
  exec "$0.bin" "$@"
  #endif
  
  #include <stdio.h>

  int
  main(int argc, char *argv[]) {
    puts("Hello world!");
    return 0;
  }
It works, because by default system shell will be spawned.


How does this work? Is there an alternate form for #! (aka shebang)?

EDIT: hah, I should have realized that "//" is the C comment and "//usr/bin/tcc" is equivalent to "/usr/bin/tcc". Clever!


I always love these things that are really scripts in two (or more) languages. Common is something like this on Windows to make a command script a Perl script. It lets you run your Perl script without having to make sure you had .pl extensions associated. Though, you needed Perl installed, so I never quite got the point.

    @rem = '--*--Perl-*--
    @echo off
    echo Hello from a command script
    perl -x -S %0 %*
    goto endofperl
    @rem ';
    #!perl
    print "Hello from Perl!\n";
    __END__
    :endofperl
By far very less common is to stuff some Perl code in a C# source file. A dev that was way too clever for his own code did this to have a Perl script that would update a C# script file and be self contained. This is a simple example of what it looked like, minus the instant headache anyone got that had to look at the real version:

    #if PerlScript
    $csharp=<<WhyJustWhy;
    #endif
    using System;

    namespace Example
    {
        public static class Program
        {
            static void Main()
            {
                Console.WriteLine("Hello from C#");
            }
        }
    }

    #if PerlScript
    WhyJustWhy
    print("Hello from Perl!")
    __END__
    #endif


"C comment," you mean.

So it's running the file as a shell script, where the first line runs tcc on the current file and quits.


Fixed typo, thanks. Yeah for some reason it didn't click when I saw it.


From man tcc(1):

    #!/usr/bin/tcc -run
    #include <stdio.h>
    
    int main()
    {
        printf("Hello World\n");
        return 0;
    }


D code can also be run this way:

    #!/usr/bin/rdmd
    void main() { import std.stdio; writeln("hello"); }


And add the -b commandline flag to enable the memory and bounds checker.


Just as a side note... It's important to say that it still does get compiled... it's not some kind of interpreted mode.


I used to learn the source code of TCC 4 years ago and the experience is...pretty awful. Until then, I never know that compilers aren't reentrant and is supposed to follow the Unix philosophy -- use global state but do your thing as minimalistic as you can then let people spawn your program as a process to turn a blind eye to lack of reentrancy. This means you can't embed TCC to your program without TLS or global lock to each compilation however, and I think I already lost my reentrancy patch cause the maintainers said they won't accept it :(

Now I carry on and would like to make a TCC rival. I also wondered if I can make a TCCPP until Iearnt not only C++ syntax is ambiguous (and so we need some form of context sensitivity, and GLR is one good technique to handle C++ type of thing) C++ templates are Turing-complete effectively meaning it will be hellish hard. I abandoned this idea thereafter.


"I also wondered if I can make a TCCPP until Iearnt not only C++ syntax is ambiguous (and so we need some form of context sensitivity)"

I believe a C parser would also need some form of context sensitivity. "T * x;" can be a declaration or a statement. However, C++'s syntax is larger than C's and I believe that there may be more cases where context sensitivity is needed to resolve ambiguities.

https://eli.thegreenplace.net/2007/11/24/the-context-sensiti...


Another great piece by Eli on how Clang solves this problem and the nested functions inside C++ classes is https://eli.thegreenplace.net/2012/07/05/how-clang-handles-t....

Clang basicaly moves the semantic processing to another pass, and tokenizes both types and variables as ‘identifiers’.

I’d agree that C/C++ have some sort context sensitivity. However I think this is not an ambiguity. You can deduce a single logical solution when you see T *x; . If the T in scope is a variable you do multiplication, if the T in scope is a type definition you make a variable declaration.

Real ambiguity happens when there are two logical solutions to the same statement and the language specification has to prefer one to remove ambiguity and undefined behaviour.

For example a C# code example from spec that is ambigious and therefore doesn’t compile:

“ static void F(bool a, bool b) { Console.Writeline($”{a} and {b}”); }

static void Main(string[] args) { int G = 2; int A = 1; int B = 2; F(True, False); F(G<A, B>(7)); ”

The compiler prefers to interpret the last line as a function call with one argument, which is a call to a generic method G (that doesn’t exist in the code above) with two type arguments a one regular argument. Instead of one function call with two boolean arguments.


It can be complicated even when it's not ambiguous. Consider:

   template<size_t N = sizeof(void*)> struct a;

   template<> struct a<4> {
       enum { b };
   };

   template<> struct a<8> {
       template<int> struct b {};
   };

   enum { c, d };

   int main() {
       a<>::b<c>d;
   }
This is not ambiguous, but the meaning of the line inside main() depends on the compiler and the target - it can be either a declaration:

   a<>::b<c> d;
or an expression:

   a<>::b < c > d;
depending on which b gets selected. This in turn affects future references to d etc.


Note that this is only an issue in non-generic context. For dependent names you have to disambiguate with "typename" and "template".

But yeah, it looks like a C++ compiler has to instantiate templates in lock-step with parsing to make this work, but possibly there are some tricks that are applicable here. In any case, this looks like a hard problem.


I am not familiar with C++ templates, but isn’t this a similar ambigious pattern? If the compiler doesn’t evaluate the sizeof’s before parsing, there are still more than one equally logical ways to parse it.

Nevertheless I think the C# way of choosing generics over comparison without semantic processing is strange. The example above should run as intended (as a simple function call with two arguments) and not fail because of some spec rule.


I vaguely remember C++ template crocodiles being white space sensitive some time ago when templates were nested at variable declaration.


Yes, the lexer considered >> as a single token, so you had to you write > >. The standard was changed in c++11 to allow the syntax; that definitely complicated parsing, but at least now heavily templated code looks a 100% less ugly.


No, that typedef problem is what C++ inherited from C, which is easily handled if you have a Linear-bounded TM (actually, a long time ago I think if our parser is able to interpret different state by different extra memory (like symbol table and AST), it is actually mimicking a restricted TM...until I learnt about some details of Chomsky Hierarchy)

What I mean is something like "Foo<Bar<Baz>>", the ">>" part can be a little bit challenging because we don't know if this is a right shift or not without context

C++ is also Turing Complete that its template system can implement boundless recursion and infinitely expand on its own without termination...essentially it means it may not even halt. That's why template metaprogramming is a (hard and horrible) thing...it's a sublanguage that uses C++ syntax and generates C++ code making the parsing even more non-deterministic.


It's worth reading through all the other bellard projects. He is an amazingly productive developer.

https://news.ycombinator.com/from?site=bellard.org


Perhaps most famously, the original creator of both ffmpeg and qemu.



The author is incredible:

> On 31 December 2009 he claimed the world record for calculations of pi, having calculated it to nearly 2.7 trillion places in 90 days. Slashdot wrote: "While the improvement may seem small, it is an outstanding achievement because only a single desktop PC, costing less than US$3,000, was used—instead of a multi-million dollar supercomputer as in the previous records."[10][11] On 2 August 2010 this record was eclipsed by Shigeru Kondo who computed 5 trillion digits, although this was done using a server-class machine running dual Intel Xeon processors, equipped with 96 GB of RAM.


I still wonder about the feasibility of Rob Landley's idea of QCC - TCC with TCG backend - the code generator used by qemu which comes originally from TCC itself. That would make it fast to compile and with quite an extensive repertoire of architectures covered.

https://landley.net/qcc/


On a somewhat related note, I always thought dietlibc was a pretty cool project (basically trying to get libc to be very tiny). It's been a pretty long time since I browsed the code but I remember it being quite elegant.

https://www.fefe.de/dietlibc/


From my experience musl libc[1] is a more popular project in this space

[1] https://musl.libc.org/


The creator of Zig language wrote a praise to musl:

https://andrewkelley.me/post/why-donating-to-musl-libc-proje...


> 2. Fabrice Bellard's Tiny C Compiler. You can't compile the diet libc with it.

:(


Heads up: Though amazing, the compiler makes use of the deprecated POSIX setcontext API[1], which means you may have trouble running it on some modern distributions.

[1] https://en.wikipedia.org/wiki/Setcontext


However, you can switch to pthread instead by using the -pthread flag or the -D_REENTRANT and -lpthread flags, if you don't want to rely on setcontext.


Do modern distros really not have setcontext? If they don't it shouldn't be too hard to implement in assembly.


It's a very tiny API, and requires little assembly, so I agree.


Its not dead, i use from git for years.

https://repo.or.cz/w/tinycc.git


wtf are those tags?


Looks like anyone can add them right next to the tag cloud...


Classic mistake to allow anyone to add content to a public site anonymously


Yeah, I raised this on the mailing list almost a year ago. The people maintaining TinyCC know about it but don't care enough to take action.


Anyone being impressed by the compile time of the modern languages such as Switft, Rust or Go should try TCC at least once in their lifetime.

Both compilation and link are generally finished before your fingers have relased the enter key...


I promise you, no one is impressed by Rust’s compile times.


You should not mention "Rust" and "fast compile time" in the same sentence, unfortunately.

https://prev.rust-lang.org/en-US/faq.html#why-is-rustc-slow


That was sarcasm on my side ^^


The compilation speed is a revelation, especially on older hardware: https://gist.github.com/cellularmitosis/6c3a17b54881899b645a...


Is it similar to PCC ?


My main experience with tcc was that it was the only C compiler available for Damn Small Linux (a <50 MB Linux distro with X and Firefox).


I loved DSL! I can’t remember the specs, but I had an ancient machine lying around in about 2004, that would have been otherwise worthless, and put DSL on it just to check it out. I remember being so impressed with how productive I could be and how zippy that old machine could be with such a lightweight OS. Fond memories :)


I bumped into DSL in ~2006 when I was looking for something that would run on an old Pentium 120 MHz. I even bought the book!


How does DSL compare with Alpine?


I wouldn't use DSL these days, it's very old :)

DSL's goal was to provide most of the tools you might need in a tiny footprint. The platform target was the old "thin-client" style machines which might pack a 500 MHz AMD Geode or a VIA C3 processo and have either no internal storage or maybe a few hundred megabytes for preboot environments.

They were able to fit X (Xfree86, iirc) w/ Fluxbox, Firefox, Ted (a word processor), and a few other things into a 50 MB "bizcard" image (PCMCIA cards & compact flash).

Another thing it was designed to do was allow you to install most of the common things to a small internal storage device, while running the rest off of optical media.

It dates to the time before common support for USB booting. Also kinda crazy to realize you could fit Firefox, a windowing environment, and a kernel into 50 MiB while today the standard vim install is 33 MiB...


*it used Xvesa, not Xfree86.


Alpine doesn't use a 2.4 kernel.

(In all seriousness, I am curious as well.)


I mean, the project kinda died in the 2008 timeframe...


How funny, I use tcc /all the time/ and I was just tinkering with one 'script' [0] the other day. Basically more often than not for me it's easier/quicker to write stuff in C than fighting with another script language that will break next time I want to use that 'script'. C works.

of course I could compile it, but really, it doesn't have to, and to develop/try that sort of small C tools, tcc is just unbeatable.

[0]: https://gist.github.com/buserror/227a7e64c92acece821ec8ee587...


For someone who wanted to read more about Fabrice Bellard.

https://web.archive.org/web/20110726063943/http://www.freear...



> Measures were done on a 2.4 GHz Pentium 4. Real time is measured. Compilation time includes compilation, assembly and linking.

How old is this project?


> (Nov 18, 2001) TCC version 0.9 is out. First public version.


been already discussed numerous times. But I never miss a chance to sing praises to Bellard. One of the very few real programming geniuses.


I tried to use that for toy C programs on Windows 10, but for some reasons I cannot run these programs in the invite CMD afterwards. It works only if i use tcc run or something like that.


I wanted to comb through the code without cloning, but can't tell if that's possible with the git web UI they're using. Anybody know if it is?


They're using gitweb. You can read the code for any commit, just click the 'tree' link associated with it. Alternatively, just click the 'tree' link at the top for the tree of the latest commit in their default 'mob' branch.


Yes, you can look at the code here: https://repo.or.cz/tinycc.git/tree


As a side note, I'm surprised how big Links is!


tcc is part of the guix bootstrap project https://guix.gnu.org/blog/2019/guix-reduces-bootstrap-seed-b... which aims to be able to rebuild everything from source code plus the smallest possible binary seed


So that's currently scheme-blob -> cc-in-scheme -> tcc -> ancient-gcc -> proper-gcc? Wow, that's something that really deserves the term bootstrapping.


It looks like there is a real effort now to get TCC working on macOS which would be amazing.


Can’t use it in iOS but I guess may be the mac arm ...


For some time it has a risc64 backend.


What is the author upto now?



On another compiler: https://bellard.org/ffasn1/


someone to revive tccboot ?


It offers bounds checking! The documentation links to

http://www.doc.ic.ac.uk/~phjk/BoundsChecking.html

Someone needs to add this to llvm.


Whilst TCC's bounds check was rare for quite some time, the bigger compilers do now actually offer it.

Clang supports bounds checking as part of -fsanitize=address, (though with a few more flags you can _just_ have bounds checking instead of the other sanitisation options). (Since around 2015?)

GCC supports bounds checking and others, depending on which frontend you're using the options can change. (Like -fbounds-checking for C, and -fcheck=all for gfortran). (Since around 2013? GCC 4.71)

Even Intel has -check=bounds. (Though not under macOS). (Since around 2015?)


Wow did not know that.

Is there a distribution that offers bounds checking for all linux software ? I wonder how slow a typical LAMP stack will be. My guess is no more than 5x. I think thats an acceptable tradeoff. I'm guessing there's a way to add global compiler flags in source distributions like Arch linux / BSD.


Gentoo has a few ways to set up global C flags [0], because the default is to compile all your software yourself.

(Arch isn't actually a source distro, it's binary, though rolling release. Gentoo is, however).

[0] https://wiki.gentoo.org/wiki/USE_flag




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: