It's not just a tiny C compiler - it's a compiler you can use as a static library that can compile your C code directly to memory. And then you can call into it, if you dare.
I used it as a scripting backend for a long time, but eventually you will realize that you are not the only one that could write scripts, but you also really need to have a trust boundary (or just a different address space), so that errors in the script don't drag the whole thing down.
That's where Lua and LuaJIT shines: It's simple, and it's sandboxed. However, there is still one thing missing. These days, several programming languages have several targets that they can compile to, and so imagine if you could emulate some platform at high performance with low overhead, in a sandbox. You would then be able to script in whatever language you so desire, provided it has that target.
Unfortunately, those languages tend to be system languages and not the absolute best for scripting. With one big exception: Compile-time programming support.
> The only reasonably safe way to run untrusted/malicious Lua scripts is to sandbox it at the process level. Everything else has far too many loopholes.
The issue isn't that Lua can't nominally provide a sandboxed environment--it can, better than almost any other language. The central issue is whether LuaJIT, PUC Lua, or any other particular piece of software can be made sufficiently free of bugs that you can trust such a sandbox to run potentially malicious code.
The answer in the case of LuaJIT is definitely no, because the JIT engine is sufficiently complex that exploits are inevitable. Note that this is also the case with JavaScript. Many browser exploits start with some codegen bug in V8 or SpiderMonkey. And there are many more eyeballs looking to fix bugs in V8 and SpiderMonkey than LuaJIT, so in the case of LuaJIT the prudent answer is that you should never trust it to run potentially malicious code.
The case for PUC Lua is more nuanced. Lua used to have a bytecode verifier, but it was removed for 5.2 because too many bugs were found in the verifier, and because the VM relied heavily on the verifier to filter bad opcode patterns, that led to sandbox breakouts. This made the PUC developers believe that the verifier made Lua worse off as compared to a more wholistic emphasis on correctness and robustness, so they dropped the verifier. They also dropped any pretense that you could safely sandbox bytecode (i.e. precompiled scripts); if you want a sandbox in Lua you should only load untrusted code as plain Lua scripts into the sandboxed environment. To that end Lua 5.2 added a parameter to all APIs for loading code that specified whether to accept text scripts or binary bytecode. In other words, the bytecode verifier was considered a third wheel, so they removed it and redirected attention to the compiler and the rest of the VM.
So for PUC Lua the issue really comes down to how prudent it is to draw a trust boundary around a pure Lua sandbox; or rather, what your adversarial model is, precisely. PUC Lua is committed to Lua's sandboxing features, but many developers are fairly of the opinion that the only way to run untrusted code, if you're to run untrusted code at all, is using either a hardware VM or a very strict seccomp jail. If you're of the latter opinion, the language is irrelevant--you shouldn't trust Lua, JavaScript, Java, or any other language environment, period. In practice, however, even people of the latter opinion generally apply the principle of defense in depth. That's why browser JavaScript APIs and capabilities are still relatively limited, even though browsers execute JavaScript in OS-based sandboxes. In most practical contexts Lua's sandboxing features still provide great value; it just needs to be understood that they're a complement to rather than substitute for process sandboxing.
The same analysis applies to WebAssembly, FWIW, especially JIT'ing WASM environments. Anybody who thinks WASM is a magic cureall for running untrusted code is mistaken.
Research projects like RockJIT [0] can introduce a more structured approach to some aspects of JIT security. To my knowledge though it hasn't seen real-world use.
Presumably a formal approach could also work. CompCert exists, after all.
I see, that's unfortunate. I don't have those issues in my emulated environment though. I don't use Lua anymore - I'm running emulated RISC-V in my own emulator.
I have been benchmarking against LuaJIT for the longest time because I thought it was an equal in that respect. Guess it should have been against regular Lua.
I think there’s a difference between trying to prevent untrusted code from escaping a sandbox, and trying to prevent it from locking up the CPU.
Lua is commonly used for scripting games. If I load an untrusted script into the game I’m playing and it can access all of my files, it’s potentially catastrophic. OTOH if the worst an untrusted script can do is use 100% CPU (and cause the game I’m playing to lock up so I need to restart), it’s a minor annoyance.
AFAIK in the absence of bugs (and regular Lua is pretty close to that [0]), Lua’s sandboxing features are sufficient to protect against the former but not the latter. Am I right to think that’s often good enough? It seems a huge improvement over embedding, say, Python - I wonder how it compares to a embedding a JavaScript engine like V8?
You can see this technique being used in TCCBOOT, using tcc as a boot loader to compile A Linux kernel into memory and then run it in a handful of seconds.
Yeah well, modern Linux cannot be compiled in 15 seconds. The link is from 2004. OTOH, I wonder how quickly modern Ryzen would be able to boot 2004 kernel from the source.
Kernel compile time has apparently stayed pretty static over the years (or at least tracked with perf improvements). That being said, I think that takes into account SMP which probably wouldn't be easily accessible to tccboot.
I feel like embedding a web assembly runtime and then exposing scripting API to it would give you similar benefits (you would have to add a compile C to WASM step before loading) while giving you a really secure sandbox.
It is more complicated than just embedding a scripting language (or a scripting compiler :)) but more general and secure.
Especially if WASM evolves to a point where you can compile C# and other high level languages to it (you can to a point already but it's not on par with native runtimes) - it would be the most general embedding runtime.
> and so imagine if you could emulate some platform at high performance with low overhead, in a sandbox. You would then be able to script in whatever language you so desire, provided it has that target.
My understanding is that this is what Google's PNaCl was about (https://developer.chrome.com/native-client/nacl-and-pnacl) : leveraging LLVM bitcode for both portability (source and target) and native performance, being able to sandbox properly the resulting executable.
I got it to work as a DLL for calling into C from common-lisp when I needed something that was header-file dependent. It was really just a proof of concept though because if you distribute source, you can require that they have CC installed just as easily as libtcc, and if you distribute a binary, you can grovel ahead of time and don't need libtcc...
Understatement of the century! :-) The breadth of his work, all non-trivial, is truly humbling. All done without fanfare or self-aggrandizement. I can't even find a good interview of him to understand his mind and thought-process.
The usual suspects: Reddit, Slashdot, ArsTechnica and of course HN.
And the interview should be purely on the technical side i.e his educational background, how he approaches design and programming, how he learns new technical domains, his thoughts on languages, advice to young programmers etc. Given his off the charts productivity, i suspect he has a highly efficient way of learning and doing things which i would like to copy :-)
Totally second that. Just think for a moment how much of modern computing relies on his work even just looking at FFPMEG and QEMU. The guy should get a United Nations medal and pile of gold for services to (computing) humanity.
... and has contributed to the state of the art of computing Pi’s digits - IIRC, he found a previously unknown expansion of the digits of Pi, and used it to calculate more digits than anyone before him.
I always love these things that are really scripts in two (or more) languages. Common is something like this on Windows to make a command script a Perl script. It lets you run your Perl script without having to make sure you had .pl extensions associated. Though, you needed Perl installed, so I never quite got the point.
@rem = '--*--Perl-*--
@echo off
echo Hello from a command script
perl -x -S %0 %*
goto endofperl
@rem ';
#!perl
print "Hello from Perl!\n";
__END__
:endofperl
By far very less common is to stuff some Perl code in a C# source file. A dev that was way too clever for his own code did this to have a Perl script that would update a C# script file and be self contained. This is a simple example of what it looked like, minus the instant headache anyone got that had to look at the real version:
#if PerlScript
$csharp=<<WhyJustWhy;
#endif
using System;
namespace Example
{
public static class Program
{
static void Main()
{
Console.WriteLine("Hello from C#");
}
}
}
#if PerlScript
WhyJustWhy
print("Hello from Perl!")
__END__
#endif
I used to learn the source code of TCC 4 years ago and the experience is...pretty awful. Until then, I never know that compilers aren't reentrant and is supposed to follow the Unix philosophy -- use global state but do your thing as minimalistic as you can then let people spawn your program as a process to turn a blind eye to lack of reentrancy. This means you can't embed TCC to your program without TLS or global lock to each compilation however, and I think I already lost my reentrancy patch cause the maintainers said they won't accept it :(
Now I carry on and would like to make a TCC rival. I also wondered if I can make a TCCPP until Iearnt not only C++ syntax is ambiguous (and so we need some form of context sensitivity, and GLR is one good technique to handle C++ type of thing) C++ templates are Turing-complete effectively meaning it will be hellish hard. I abandoned this idea thereafter.
"I also wondered if I can make a TCCPP until Iearnt not only C++ syntax is ambiguous (and so we need some form of context sensitivity)"
I believe a C parser would also need some form of context sensitivity. "T * x;" can be a declaration or a statement. However, C++'s syntax is larger than C's and I believe that there may be more cases where context sensitivity is needed to resolve ambiguities.
Clang basicaly moves the semantic processing to another pass, and tokenizes both types and variables as ‘identifiers’.
I’d agree that C/C++ have some sort context sensitivity. However I think this is not an ambiguity. You can deduce a single logical solution when you see T *x; . If the T in scope is a variable you do multiplication, if the T in scope is a type definition you make a variable declaration.
Real ambiguity happens when there are two logical solutions to the same statement and the language specification has to prefer one to remove ambiguity and undefined behaviour.
For example a C# code example from spec that is ambigious and therefore doesn’t compile:
“
static void F(bool a, bool b) {
Console.Writeline($”{a} and {b}”);
}
static void Main(string[] args) {
int G = 2;
int A = 1;
int B = 2;
F(True, False);
F(G<A, B>(7));
”
The compiler prefers to interpret the last line as a function call with one argument, which is a call to a generic method G (that doesn’t exist in the code above) with two type arguments a one regular argument. Instead of one function call with two boolean arguments.
Note that this is only an issue in non-generic context. For dependent names you have to disambiguate with "typename" and "template".
But yeah, it looks like a C++ compiler has to instantiate templates in lock-step with parsing to make this work, but possibly there are some tricks that are applicable here. In any case, this looks like a hard problem.
I am not familiar with C++ templates, but isn’t this a similar ambigious pattern? If the compiler doesn’t evaluate the sizeof’s before parsing, there are still more than one equally logical ways to parse it.
Nevertheless I think the C# way of choosing generics over comparison without semantic processing is strange. The example above should run as intended (as a simple function call with two arguments) and not fail because of some spec rule.
Yes, the lexer considered >> as a single token, so you had to you write > >. The standard was changed in c++11 to allow the syntax; that definitely complicated parsing, but at least now heavily templated code looks a 100% less ugly.
No, that typedef problem is what C++ inherited from C, which is easily handled if you have a Linear-bounded TM (actually, a long time ago I think if our parser is able to interpret different state by different extra memory (like symbol table and AST), it is actually mimicking a restricted TM...until I learnt about some details of Chomsky Hierarchy)
What I mean is something like "Foo<Bar<Baz>>", the ">>" part can be a little bit challenging because we don't know if this is a right shift or not without context
C++ is also Turing Complete that its template system can implement boundless recursion and infinitely expand on its own without termination...essentially it means it may not even halt. That's why template metaprogramming is a (hard and horrible) thing...it's a sublanguage that uses C++ syntax and generates C++ code making the parsing even more non-deterministic.
> On 31 December 2009 he claimed the world record for calculations of pi, having calculated it to nearly 2.7 trillion places in 90 days. Slashdot wrote: "While the improvement may seem small, it is an outstanding achievement because only a single desktop PC, costing less than US$3,000, was used—instead of a multi-million dollar supercomputer as in the previous records."[10][11] On 2 August 2010 this record was eclipsed by Shigeru Kondo who computed 5 trillion digits, although this was done using a server-class machine running dual Intel Xeon processors, equipped with 96 GB of RAM.
I still wonder about the feasibility of Rob Landley's idea of QCC - TCC with TCG backend - the code generator used by qemu which comes originally from TCC itself. That would make it fast to compile and with quite an extensive repertoire of architectures covered.
On a somewhat related note, I always thought dietlibc was a pretty cool project (basically trying to get libc to be very tiny). It's been a pretty long time since I browsed the code but I remember it being quite elegant.
Heads up: Though amazing, the compiler makes use of the deprecated POSIX setcontext API[1], which means you may have trouble running it on some modern distributions.
However, you can switch to pthread instead by using the -pthread flag or the -D_REENTRANT and -lpthread flags, if you don't want to rely on setcontext.
I loved DSL! I can’t remember the specs, but I had an ancient machine lying around in about 2004, that would have been otherwise worthless, and put DSL on it just to check it out. I remember being so impressed with how productive I could be and how zippy that old machine could be with such a lightweight OS. Fond memories :)
DSL's goal was to provide most of the tools you might need in a tiny footprint. The platform target was the old "thin-client" style machines which might pack a 500 MHz AMD Geode or a VIA C3 processo and have either no internal storage or maybe a few hundred megabytes for preboot environments.
They were able to fit X (Xfree86, iirc) w/ Fluxbox, Firefox, Ted (a word processor), and a few other things into a 50 MB "bizcard" image (PCMCIA cards & compact flash).
Another thing it was designed to do was allow you to install most of the common things to a small internal storage device, while running the rest off of optical media.
It dates to the time before common support for USB booting. Also kinda crazy to realize you could fit Firefox, a windowing environment, and a kernel into 50 MiB while today the standard vim install is 33 MiB...
How funny, I use tcc /all the time/ and I was just tinkering with one 'script' [0] the other day. Basically more often than not for me it's easier/quicker to write stuff in C than fighting with another script language that will break next time I want to use that 'script'. C works.
of course I could compile it, but really, it doesn't have to, and to develop/try that sort of small C tools, tcc is just unbeatable.
I tried to use that for toy C programs on Windows 10, but for some reasons I cannot run these programs in the invite CMD afterwards. It works only if i use tcc run or something like that.
They're using gitweb. You can read the code for any commit, just click the 'tree' link associated with it. Alternatively, just click the 'tree' link at the top for the tree of the latest commit in their default 'mob' branch.
So that's currently scheme-blob -> cc-in-scheme -> tcc -> ancient-gcc -> proper-gcc? Wow, that's something that really deserves the term bootstrapping.
Whilst TCC's bounds check was rare for quite some time, the bigger compilers do now actually offer it.
Clang supports bounds checking as part of -fsanitize=address, (though with a few more flags you can _just_ have bounds checking instead of the other sanitisation options). (Since around 2015?)
GCC supports bounds checking and others, depending on which frontend you're using the options can change. (Like -fbounds-checking for C, and -fcheck=all for gfortran). (Since around 2013? GCC 4.71)
Even Intel has -check=bounds. (Though not under macOS). (Since around 2015?)
Is there a distribution that offers bounds checking for all linux software ? I wonder how slow a typical LAMP stack will be. My guess is no more than 5x. I think thats an acceptable tradeoff. I'm guessing there's a way to add global compiler flags in source distributions like Arch linux / BSD.
I used it as a scripting backend for a long time, but eventually you will realize that you are not the only one that could write scripts, but you also really need to have a trust boundary (or just a different address space), so that errors in the script don't drag the whole thing down.
That's where Lua and LuaJIT shines: It's simple, and it's sandboxed. However, there is still one thing missing. These days, several programming languages have several targets that they can compile to, and so imagine if you could emulate some platform at high performance with low overhead, in a sandbox. You would then be able to script in whatever language you so desire, provided it has that target.
Unfortunately, those languages tend to be system languages and not the absolute best for scripting. With one big exception: Compile-time programming support.