Building the fastest Lua interpreter automatically

haberman · on Nov 23, 2022

> With the tail-call approach, each bytecode now gets its own function, and the pathological case for the C/C++ compiler is gone. And as shown by the experience of the Google protobuf developers, the tail-call approach can indeed be used to build very good interpreters. But can it push to the limit of hand-written assembly interpreters? Unfortunately, the answer is still no, at least at its current state.

> The main blockade to the tail-call approach is the callee-saved registers. Since each bytecode function is still a function, it is required to abide to the calling convention, specifically, every callee-saved register must retain its old value at function exit.

This is correct, wasting of callee-saved registers is a shortcoming of the approach I published about protobuf parsing (linked from the first paragraph above). More recently I have been experimenting with a new calling convention that uses no callee-saved registers to work around this, but the results so far are inconclusive. The new calling convention would use all registers for arguments, but allocate registers in the opposite order of normal functions, to reduce the chance of overlap. I have been calling this calling convention "reverse_cc".

I need to spend some time reading this article in more detail, to more fully understand this new work. I would like to know if a new calling convention in Clang would have the same performance benefits, or if Deegen is able to perform optimizations that go beyond this. Inline caching seems like a higher-level technique that operates above the level of individual opcode dispatch, and therefore somewhat orthogonal.

sillycross · on Nov 23, 2022

> More recently I have been experimenting with a new calling convention that uses no callee-saved registers to work around this, but the results so far are inconclusive. The new calling convention would use all registers for arguments, but allocate registers in the opposite order of normal functions, to reduce the chance of overlap. I have been calling this calling convention "reverse_cc".

As explained in the article, LLVM already has the calling convention you are exactly looking for: the GHC convention (cc 10). You can use it to "pin" registers by passing arguments at the right spot. If you pin your argument in a callee-saved register of the C calling conv, it won't get clobbered after you do a C call.

haberman · on Nov 23, 2022

I tried exposing the "cc 10" and "cc 11" calling conventions to Clang, but was unsuccessful. With "cc 10" I got crashes at runtime I could not explain. With "cc 11" I got a compile-time error:

> fatal error: error in backend: Can't generate HiPE prologue without runtime parameters

If "cc 10" would work from Clang I'd be happy to use that. Though I think reverse_cc could potentially still offer benefits by ordering arguments such that callee-save registers are assigned first. That is helpful when calling fallback functions that take arguments in the normal registers.

obl · on Nov 23, 2022

If you didn't already, I'd recommend compiling llvm/clang in debug or release+assert mode when working on it. The codebase is quite heavy in debug-only assertions even for relatively trivial things (like missing implementation of some case, some combination of arguments being invalid, ...) which means that it's pretty easy to get into weird crashes down the line with assertions disabled.

sillycross · on Nov 23, 2022

Yeah, you should do it at LLVM IR level.

JonChesterfield · on Nov 23, 2022

Adding calling conventions to clang is fairly straightforward (e.g. https://reviews.llvm.org/D125970) but there's a limited size structure in clang somewhere that makes doing so slightly contentious. Partly because they're shared across all architectures. At some point we'll run out of bits and have to do something invasive and/or slow to free up space.

haberman · on Nov 23, 2022

Yes, I've already experimented with doing this, see: https://github.com/haberman/llvm-project/commit/e8d9c75bb35c...

But when I tried it on my actual code, the results weren't quite as good as I hoped, due to sub-optimal register allocation.

JonChesterfield · on Nov 23, 2022

I think I remember one called preserve_none. That might have been in a downstream fork though.

The calling convention you want for a coroutine switch is caller saves everything live, callee saves nothing. As that means spilling is done on the live set, to the stack in use before switching. Return after stack switch then restores them.

So if a new convention is proposed that solves both fast stackful coroutines and bytecode interpreter overhead, that seems reasonable to me.

chc4 · on Nov 23, 2022

Have you read the Copy-and-patch compilation paper? It seems broadly similar to what you're describing, and Deegen. They use GHC's calling convention for it's snippets which has only caller-saved registers and all arguments passed in registers in order to optimize stitching the snippets together (via JIT templating in it, but with tailcalls in your case would probably also be the same design space).

sillycross · on Nov 23, 2022

The copy-and-patch paper is also written by me. Deegen a follow-up work of copy-and-patch, and it will use copy-and-patch as a tool to build its JIT tiers in the future.

chc4 · on Nov 23, 2022

Haha, woops! I didn't realize that :)

svnpenn · on Nov 23, 2022

https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

tekknolagi · on Nov 23, 2022

For your (and maybe other people's) information, this post is authored by the person you are replying to.

saagarjha · on Nov 23, 2022

I’ve been working on an early design of a high-performance dynamic binary translator that cannot JIT, and have reached a very similar conclusion as the author. We have an existing threaded interpreter but it’s a mess of hard-to-maintain assembly for two architectures, and we run into funny issues all the time where the two diverge. Plus, being handwritten by people who are not scheduling experts, there is probably some performance left on the table because of our poor choices and the design making it difficult to write complex-but-more-performant code. Nobody wants to write an efficient hash for TLB lookups in a software MMU using GAS macros.

The core point I’ve identified is that existing compilers are pretty good at converting high level descriptions of operations into architecture-specific code (at least, better than we are given the amount of instructions we have to implement) but absolutely awful at doing register selection or dealing with open control flow that is important for an interpreter. Writing everything in assembly lets you do these two but you miss out on all the nice processor stuff that LLVM has encoded into Tablegen.

Anyways, the current plan is that we’re going to generate LLVM IR for each case and run it through a custom calling convention to take that load off the compiler, similar to what the author did here. There’s a lot more than I’m handwaving over that’s still going to be work, like whether we can automate the process of translating the semantics for each instruction into code, how we plan to pin registers, and how we plan to perform further optimizations on top of what the compiler spits out, but I think this is going to be the new way that people write interpreters. Nobody needs another bespoke macro assembler for every interpreter :)

haberman · on Nov 23, 2022

Great summary, this matches my experience. For straight-line code, modern C compilers can't be beat. But when it comes to register allocation, they constantly make decisions that are real head-scratchers.

One of the biggest problems is when cold paths compromise the efficiency of hot paths. You would hope that __builtin_expect() would help, but from what I can tell __builtin_expect() has no direct impact on register allocation. I wish the compiler would use this information to make sure that cold paths can never compromise the register allocation of the hot paths, but I constantly see register shuffles or spills on hot paths that are only for the benefit of cold paths.

Is there anywhere I can follow your work? I am very interested in keeping track of the state of the art.

saagarjha · on Nov 23, 2022

Yeah, I did a quick check in LLVM at some point to see what it does (query I relied on: https://github.com/llvm/llvm-project/search?q=getPredictable...) and all the results seemed to be exclusively code motion or deciding how to lower a branch. Similarly cold path outlining seemed to just want to split the function in a fairly simple manner rather than doing anything beyond that. Perhaps I missed something, but I think the current hints are just to help the branch predictor or instruction cache rather than significantly alter codegen.

Unfortunately, I don't have much to share at the moment besides my thoughts; I've done a few small tests but haven't been able to really do a full implementation yet. The primary consumer of this work would be iSH (https://github.com/ish-app/ish), which has a need for a fast interpreter, so you can at least take a look at the current implementation to see what we'd like to replace. The nature of the project means that most of my time has been tied up in things like making sure that keyboard avoidance is set up correctly and that users can customize the background color of their terminal :/

With that said, I'd be happy to chat more if you'd like–feel free to send me an email or whatever. Not sure I can say I'm at the state of the art yet, but perhaps we can get there :)

qikInNdOutReply · on Nov 23, 2022

I work on a game, that mostly uses lua as logic code, saddling on a c/c++ engine. One of the engine developers implemented lua jit years ago and found that for the actual performance, the interpreter/jit was neglibable, the most expensive thing being the actual switch between luacode and c-code, which is with a large API a constant ugly background performance loss.

The lua code by itself did not ran long enough to actually profit much from optimizations much.

So, back then we did discuss possible solutions, and one idea was to basically notice a upcoming C-Call in bytecode ahead of execution, detect the stability of the arguments ahead of time. A background thread, extracts the values, perform the Call-processing of arguments and return pushes the values onto the stack, finally setting a "valid" bit to unblock the c-call (which by then actually is no longer a call). Both sides never have a complete cache eviction and live happily ever after. Unfortunatly i have a game dev addiction, so nothing ever came of it.

But similar minds might have pushed similar ideas.. so asking the hive for this jive. Anyone ever seen something similar in the wild?

fwsgonzo · on Nov 23, 2022

I actually measured it. It was around 65-85ns for Lua and 55ns for LuaJIT (on a 7950X). It adds up every time you make a call, despite looking like a small value. Of course, you can multiply it by 2-5x when you put it in production.A decent emulator these days should have < 20ns time to first instruction. The worst emulators are the ones that require you to host them in another thread for the purpose of execution timeout using some kind of timed-wait (eg. futex op), adding a massive 10-20 micros overhead just to communicate tasks.

Ultimately the script in a game engine either makes a bunch of API calls, so we want the "system call overhead" to be that of an opaque function call (4ns), and sometimes you share memory in order to read out properties and put abstractions on them (probably not relevant to Lua). So, it's important to get going fast, and to be able to make hundreds of engine API calls without unnecessary costs.

Benchmarks: https://gist.github.com/fwsGonzo/2f4518b66b147ee657d64496811...

wruza · on Nov 23, 2022

Sorry if my comment is naive or ignorant, but have you tried using LuaJIT FFI instead of traditional int (lua_State *) API? Or is that with FFI already?

I thought FFI promised almost seamless Lua->C calls. And can even pass Lua functions into C with native signatures (but this bridge is still claimed expensive).

CapsAdmin · on Nov 23, 2022

I'm not really answering your question here but answering more in general regarding calls to functions bound from the lua c API.

There is trace stiching meant to help with code that does a lot of c calls. This was disabled in 2015 April 28 due to it being flawed but enabled again August 29 later that year with a working design.

Otherwise the recommendations are to reduce and maybe cache values on the lua side of things if you cannot use ffi.

Otherwise as mentioned in other comments if you can use ffi, use it. The performance benefit is significant.

The downside is that the ffi API is unsafe so in the context of a game development you'd have to be careful about third party scripts from modding or similar.

Another downside is if you cannot use jit (like on iOS) the performance hit can be worse than using the lua c api

ColonelPhantom · on Nov 23, 2022

I believe LuaJIT also has a C FFI that should allow for faster function calls compared to the native Lua API.

nl · on Nov 23, 2022

In CPUs they call this speculative execution. It's a good idea if you can detect side effect free code (and don't have the sandboxing issues that caused problems on CPUs).

google234123 · on Nov 23, 2022

Fatshark?

kzrdude · on Nov 22, 2022

Super interesting.

I think using the name LuaJIT Remake steps on the toes of the existing LuaJIT and I would advise using a more original name.

Anything distinctive would be better. Lua DeeJIT comes to mind, since the generator is called Deegen.

Diggsey · on Nov 23, 2022

Yeah... Especially as it's not actually a JIT compiler (yet).

gatane · on Nov 22, 2022

>More importantly, it is the world’s fastest Lua interpreter to date, outperforming LuaJIT’s interpreter by 28% and the official Lua interpreter by 171% on average on a variety of benchmarks

Wait what the hell

hinkley · on Nov 22, 2022

Faster than the LuaJIT's interpreter.

People who focus on JIT often focus on the JIT and not the interpreter. Which is a shame because if you make the uncommon paths cheaper then you can tune your code for the hot paths a bit more aggressively. You get fewer situations where you are sacrificing Best Case for Average Case performance.

cormacrelf · on Nov 23, 2022

As in, LuaJIT with the JIT disabled, ie with the `-j off` flag? That would be really good to clarify in the article. They start off talking about how they can generate an interpreter and even a JIT, LuaJIT with its JIT turned on is not benchmarked to provide the necessary perspective, the results summary says “28% faster than the LuaJIT interpreter” before the scope of the research has been clarified, and only if you read through do you discover they did not make it generate JITs (yet). It seems everyone in this thread thinks there is now a faster Lua than LuaJIT. The author didn’t claim anything incorrect, but they did lead people to believe it.

hinkley · on Nov 23, 2022

That's my read at least. There are several points where they are careful to say LuaJIT interpreter, and I don't think that's an accident. Occams Razor says that's what they meant. The gist of the article is an incremental improvement on interpreter loops, not some counterintuitive amazing breakthrough that nobody has heard before. That's consistent with apples to apples comparisons.

cormacrelf · on Nov 24, 2022

I know it's what they meant, because they were very precise about it, just not very accommodating to people unfamiliar with the terrain. This is the kind of intro I had in mind:

> "The standard architecture of a JIT compiler (like V8, Java HotSpot or LuaJIT) includes both a bytecode interpreter, and a system to recompile a function to native code at runtime. The interpreter is the slow path in a JIT compiler, and is also a general bottleneck in a purely-interpreted VM like CPython. Since we are writing a Lua interpreter and LuaJIT's is the state of the art, the baseline for performance here will be LuaJIT with its JIT turned off."

This means the same thing as the whole "28% faster than LuaJIT's interpreter" business, but nobody comes away from reading that thinking this is faster than LuaJIT as a whole.

sushibowl · on Nov 23, 2022

LuaJIT actually has a really fast and well-optimized interpreter though, and the article acknowledges this. The reason this interpreter has such great performance isn't directly because they automatically generate the assembly. The majority of the performance gain is because they implemented inline caching, an optimization that LuaJIT doesn't have:

> A lot of LJR’s speedup over LuaJIT interpreter comes from our support of inline caching. We have rewritten the Lua runtime from scratch. In LJR, table objects are not stored as a plain hash table with an array part. Instead, our table implementation employed hidden classes, using a design mostly mirroring the hidden class design in JavaScriptCore. > > Hidden class allows efficient inline caching, a technique that drastically speeds up table operations.

mraleph · on Nov 23, 2022

It would be interesting to see 1-1 comparison with LuaJIT interpreter _without corresponding runtime changes_. That would given a meaningful baseline for how much you potentially loose/gain using this approach.

Current measurement presents comparison of two wildly different implementation strategies: LuaJIT Remake uses IC and hidden classes while LuaJIT proper does not. It's a well known fact that ICs+hidden classes can lead to major performance improvements. That insight originated in Smalltalk world and was brought to dynamic language mainstream by V8† - but unfortunately it muddles the comparison here.

† V8 was open sourced on September 2, 2008... Coincidentally (though probably not :)) JSC landed their first IC prototype on September 1, 2008.

sillycross · on Nov 23, 2022

I agree with your point, but I want to point out that Deegen also provided APIs to hide all the details of inline caching. The bytecode only specifies the body lambda and the effect lambdas, and Deegen lowers it to an efficient implementation automatically, including exotic optimizations that fuses the cached IC effect into the opcode so one indirect branch can be avoided.

If LuaJIT interpreter were to employ IC, it would have to undergo a major rewrite (due to its assembly nature) to have the equally efficient code as LJR (that we generate automatically). This is one advantage of our meta-compiler approach.

Finally, although no experiment is made, my subjective opinion is already in the article:

> LuaJIT’s hand-written assembly interpreter is highly optimized already. Our interpreter generated by Deegen is also highly optimized, and in many cases, slightly better-optimized than LuaJIT. However, the gain from those low-level optimizations are simply not enough to beat LuaJIT by a significant margin, especially on a modern CPU with very good instruction-level parallelism, where having a few more instructions, a few longer instructions, or even a few more L1-hitting loads have negligible impact on performance. The support of inline caching is one of the most important high-level optimizations we employed that contributes to our performance advantage over LuaJIT.

That is, if you compare the assembly between LJR and LuaJIT interpreter, I believe LJR's assembly is slightly better (though I would anticipate only marginal performance difference). But that's also because we employed a different boxing scheme (again...). If you force LJR to use LuaJIT's boxing scheme, I guess the assembly code would also be similar since LJR's hand-written assembly is already optimal or at least very close to optimal.

abecedarius · on Nov 22, 2022

There's some prior art on DSLs for efficient interpreters, like Anton Ertl's vmgen. https://www.complang.tuwien.ac.at/anton/vmgen/

touisteur · on Nov 22, 2022

Nice, do you know whether this DSL was ever used successfully fir other endeavours (abstract interpretation, lowering to Why3, any kind of symbolic execution? or a language server?). I've been looking for a 'flex/bison with complete semantics' and it might be a piece of the puzzle.

abecedarius · on Nov 23, 2022

I don't know -- there were at least a couple papers about it, but I haven't been following. At the time it was first announced I used some of the ideas in one of my projects (https://github.com/darius/idel/blob/master/src/opcodes.awk) but unfortunately didn't develop it further.

UncleEntity · on Nov 23, 2022

The main difference I see is this project seems to be using C++ (plus C macros?) directly instead of a DSL. And some llvm magic thrown in for good measure.

abecedarius · on Nov 23, 2022

Fair. I think of DSLs as a way of thinking about the program you want to write; if a scheme is realizable with few compromises in an existing language, it can still be considered "DSL style". SICP calls this style "stratified design".

UncleEntity · on Nov 23, 2022

You can do all sorts of “stratified design” with C++ templates which I assume is underneath the covers of this system. boost::spirit is probably the most excessive use of this that I’ve seen so far.

riffraff · on Nov 23, 2022

ah, I remembered this but could not find it again, thanks.

Many, many years ago there was even someone trying to write a ruby VM with it[1]. Wow, I'm old.

[1] https://savannah.nongnu.org/projects/carbone

tromp · on Nov 22, 2022

Nice to see Haskell make an appearance in an article about Lua and C/C++:

> For example, at LLVM IR level, it is trivial to make a function use GHC calling convention (a convention with no callee-saved registers)

This refers to the following section of [1]

“cc 10” - GHC convention This calling convention has been implemented specifically for use by the Glasgow Haskell Compiler (GHC). It passes everything in registers, going to extremes to achieve this by disabling callee save registers. This calling convention should not be used lightly but only for specific situations such as an alternative to the register pinning performance technique often used when implementing functional programming languages. At the moment only X86 supports this convention and it has the following limitations:

On X86-32 only supports up to 4 bit type parameters. No floating-point types are supported. On X86-64 only supports up to 10 bit type parameters and 6 floating-point parameters. This calling convention supports tail call optimization but requires both the caller and callee are using it.

[1] https://llvm.org/docs/LangRef.html#calling-conventions

skissane · on Nov 23, 2022

I didn't realise LLVM supported such a wide variety of calling conventions.

It would be cool if they supported OpenVMS/x86-64 calling convention–even not on OpenVMS. Why? Well, OpenVMS calling convention has this cool little feature which nobody else seems to have – the parameter count to a function is passed in a register. What this means–people always ask "how can a C variadic function know how many parameters it was passed?" And the usual answer is – you need to add an argument to explicitly pass an argument count (the computation of which can be automated using preprocessor voodoo), or you need to pass some kind of printf-style format argument, or a sentinel value (terminal NULL), or whatever. "Why can't we just have a va_count() which tells us how many arguments we were passed?" "Not possible, per the calling convention, the caller doesn't tell us."

However, for OpenVMS, there actually is such a thing as va_count() – it pulls it from that extra register (the "Argument Information Register"–which also encodes some info on their types, so you can know whether each parameter is passed in an integer register or a floating point one.) Anyway, if LLVM/Clang/etc supported OpenVMS calling convention on other platforms, they could enjoy va_count() too, you'd just have to declare that as the calling convention for your function.

saagarjha · on Nov 23, 2022

Interestingly System-V passes the number of floating point registers used for a variadic function in rax.

skissane · on Nov 24, 2022

It only uses %al and leaves the upper 56 bits of %rax undefined. OpenVMS adds an arg count in %ah and then uses the rest of the register to hold bitmasks showing whether each register-passed argument was integer or float, and also (optionally) a pointer (stored as offset from call address) to a data structure containing type info for the stack arguments.

JonChesterfield · on Nov 24, 2022

Optimising variadic functions seems fairly low value but given they're usually a different calling convention anyway passing an extra integer is probably harmless.

skissane · on Nov 24, 2022

From what I understand, VSI's C compiler for OpenVMS/x86-64 actually gives you a choice of two different variadic calling conventions - something very close to standard SysV ABI x86-64 (which doesn't pass parameter count and type info), and the OpenVMS extension which adds that. So, while I doubt one extra "mov rax, CONSTANT" is going to be noticed (variadic function calls are rarely found on performance-critical code paths), if a function doesn't need va_count(), it can turn that off.

From what I understand, their x86 C and C++ compilers are a patched version of clang/LLVM – they've said they planned to upstream their changes, I don't think it has happened yet, but I hope it does. (Their other compilers – COBOL, Fortran, Pascal, BASIC, BLISS, MACRO – are also using LLVM as backend on x86-64, but with their legacy frontends.)

bradrn · on Nov 22, 2022

Relatedly, it’s worth noting that there are really good bindings between Lua and Haskell [https://hslua.org/], used in e.g. Pandoc for plugins.

encryptluks2 · on Nov 23, 2022

Haskell and Pandoc has too much dependency bloat. Just installing Pandoc on Arch requires like 200MB in dependencies, and over hundred haskell specific dependencies.

bradrn · on Nov 23, 2022

Well, firstly, I’d note that Haskell on Arch (including Pandoc) is broken, mostly due to the fact that Arch likes to dynamically link everything and Haskell doesn’t (see also https://www.reddit.com/r/haskell/comments/yuwcei). But yes, Haskell packages do often tend to have quite a lot of dependencies. To some extent this is because the base library is very much ‘batteries-excluded’, so dependencies are necessary for things like file management and byte strings and so on, but I do see quite a few dependencies which could be split up (e.g. ‘servant-server’, a big dependency used to create a webserver in exactly one module of pandoc).

gary_0 · on Nov 22, 2022

I think there's a typo where the Add() example code goes `lhs.As<tDouble>() && rhs.As<tDouble>()`. I'm assuming that should be a `+`?

LoganDark · on Nov 23, 2022

If there was such a typo, it doesn't exist anymore!

gary_0 · on Nov 23, 2022

Aha, they fixed it!

fsfod · on Nov 23, 2022

The author also worked on the Copy-and-Patch Compilation paper[1], I wonder if they are going to use the same idea for ones of tiers of the JIT idk if it would beat LuaJIT's JIT for code compilation speed but the machine code could be faster.

[1] https://arxiv.org/abs/2011.13127

Terretta · on Nov 25, 2022

Here's what author says: https://news.ycombinator.com/item?id=33714700

jonathanyc · on Nov 23, 2022

Awesome that the result is backed by benchmarks. The writing style is very clear and having actual code snippets is great. Favorited!

My tldr understanding that:

- writing a byte code interpreter in C++ is slow, and a big reason is callee-saved registers / the compiler doesn't optimize well when multiple operations are implemented in the same "giant switch table" function

- but it's annoying to write everything in assembly

- so the author made a compiler that glues together C++ implementations of byte code instructions (compiled to LLVM IR) into an interpreter, avoiding callee-saved registers (and performs other optimizations)

It'd be interesting to see ablation testing for the other optimizations, like fuzing indices into op codes. My bet would be that avoiding having to restore registers dominated of the performance impact--that was the case for the compiler I implemented with friends in college.

fasteo · on Nov 23, 2022

It is my understanding that benchmarks are run against LuaJIT interpreter (-j off), so here are some JIT numbers for reference, using the array3d.lua benchmark [1]

  time luajit -j off array3d.lua 
  
  real 0m1,968s
  user 0m1,915s
  sys 0m0,052s
  
  time luajit -j on array3d.lua 
  
  real 0m0,128s
  user 0m0,068s
  sys 0m0,060s

[1] https://github.com/luajit-remake/luajit-remake/blob/master/l...

ufo · on Nov 22, 2022

One caveat to pay attention to... Ever since LuaJIT and PUC-Lua diverged, PUC lua has made some significant changes to the garbage collector. I wonder if that might affect the comparison vs LuaJIT for memory intensive benchmarks such as binarytrees and tablesort.

LoganDark · on Nov 23, 2022

As detailed in the article, the garbage collectors for all interpreters tested were disabled for all benchmarks because this interpreter does not yet have garbage collection.

fwsgonzo · on Nov 23, 2022

I have been working on this too, and while I am not at all interested in embedding LLVM, the lack of calling conventions and lack of ability t make a direct jump is unfortunate.

My biggest pet peeve though is the inability to push unlikely stuff completely to the back of the program. I mean just put it away where nobody can see it. I also want to align code blocks to 1 bytes. Not 16 bytes. Not a single compiler lets me realign the instruction handlers in a way that compresses them down.

I also want to know what BOLT does that improves my interpreter by around 30%.

sillycross · on Nov 23, 2022

Clang/LLVM accepts (little-known but documented) flags to let you align code block using a custom alignment, though it only works at file-level. See [1].

However, I'm not sure if doing so is useful or necessary. Interpreter performance is sensitive to code layout (which affects hardware branch predictor accuracy), but I don't think there is a general way to optimize the code layout to make the branch predictor as happy as possible.

So if you changed your code alignment and saw a perf change, it's more likely caused by the random perf gain/loss due to code layout change, not because that 1-byte-alignment is better than 16-byte-alignment or vice versa.

[1] https://easyperf.net/blog/2018/01/25/Code_alignment_options_...

chubot · on Nov 23, 2022

This seems like an awesome way of writing faster interpreters – i.e. not in assembly, but in C++ snippets you stitch together with a tool.

I did peek at the deegen tool a bit, and it seems quite large? https://github.com/luajit-remake/luajit-remake/tree/master/d...

I would be interested in an overview of all the analysis it has to do, which as I understand is basically “automated Mike Pall”

FWIW I think this is the hand-written equivalent with LuaJIT’s dynasm tool: https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_x64.dasc (just under 5000 lines)

Also there are several of these files with no apparent sharing, as you would get with deegen:

https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_x86.dasc

https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_ppc.dasc

sigzero · on Nov 23, 2022

I wonder why the target was 5.1 since 5.4 has been out for a while now?

scythe · on Nov 23, 2022

Lua 5.2 introduced a new scoping rule (_ENV) that impacts the ability to implement efficient interpreters, which is why LuaJIT diverged from the main implementation. Insofar as Lua is notoriously the "fastest scripting language", Lua 5.1 is the "fastest Lua". I personally hope that the author eventually targets "LuaJIT flavored Lua", which incorporates some additional features from newer Lua without compromising the performance.

Edit: see replies

naniwaduni · on Nov 23, 2022

The functional impact of _ENV in Lua 5.2 below the parser level is simply that globals/function environments no longer exist, becoming syntactical sugar for access to a closed-over table. There's a fairly direct (as in, you could write a compiler for it) source-to-source transformation on a 5.2 program (that doesn't use debug or string.dump) where you replace every "global" access with explicit access to _ENV, resulting in a 5.1 with the same semantics as the 5.2 source. This came along with ABI breakage (since PUC then removed all the interfaces that interact with function environments), but at the design level the change only relaxes constraints on interpreter implementation.

5.3 does change up the data model in ways that are incompatible with certain desirable interpreter features (you generally have to choose between 64-bit integers and NaN-boxing), but the incompatibility of 5.2 specifically with efficient interpreter design is at best based on a misunderstanding based on pre-release discussion on the mailing list.

(Real talk though, Lua 5.2 was being floated around the same time period that Mike Pall had committed to LuaJIT 2 as a full rewrite. You can read a bit in between the lines of just the RC dates.)

wahern · on Nov 23, 2022

_ENV is the same thing as setfenv/getfenv in Lua 5.1, except its lexical. Unless the use of function environments also make LuaJIT slower (I doubt it as that would implicate _G and a host of other behaviors, all of which LuaJIT is renowned for making work performantly), I think the best way to interpret the original sentiment regarding _ENV is that it would be a PITA to support in a backward compatible manner. There have also been several patches over the years to add _ENV support to LuaJIT, and AFAICT they've all been relatively simple (but not necessarily simple for someone who is firmly of the opinion that Lua should have been frozen at 5.1).

sigzero · on Nov 23, 2022

Interesting. Thanks for answering.

interpol_p · on Nov 23, 2022

Probably because that's where PUC Lua and LuaJIT diverged, would be interesting to see it on 5.4

tomcam · on Nov 22, 2022

What’s a stack spill?

toast0 · on Nov 23, 2022

A stack spill is passing parameters on the stack instead of in registers. It's unavoidable when the number of parameters exceeds the number of registers available for passing parameters (the parameters spill over onto the stack) or if the parameters don't fit into the registers.

tomcam · on Nov 23, 2022

AHHH thanks. Now that I know it's obvious from the name ;)

krick · on Nov 23, 2022

I wonder what makes it so much worse on those 2 specific tasks. Especially k-nukleotide one.

flyingsky · on Nov 22, 2022

> More importantly, it is the world’s fastest Lua interpreter to date, outperforming LuaJIT’s interpreter by 28% and the official Lua interpreter by 171% on average on a variety of benchmarks

Really huge, if true!

JZL003 · on Nov 22, 2022

wow this is a great article, so readable

cornstalks · on Nov 23, 2022

Could this be used to build a better web assembly interpreter? wasm3 is the fastest Wasm interpreter I’ve found but this approach sounds very intriguing!

titzer · on Nov 23, 2022

Wasm3 is already written in the tail-call style. Its IR consists of direct threaded code: a list of function pointers that each load and tail-call the next one. The only improvement on that I could see would be if there are spills because of the calling convention running out of registers.

Running out of registers in tail calls is one of the reasons I avoided that in doing Wizard's Wasm interpreter. One constraint there is that interpretation is done in-place (i.e. the original Wasm bytes). I wrote it in hand-rolled asm and it uses 9 architectural registers.

saagarjha · on Nov 23, 2022

Wasm3 leans into the hardware stack I believe.

ufo · on Nov 22, 2022

Fascinating! I wonder if there's a way to keep this 100% inside the C ecosystem, without having to reach for an LLVM dependency...

IshKebab · on Nov 22, 2022

No need to wonder - the article clearly explains why there isn't. That's why they made this tool in the first place!

runevault · on Nov 22, 2022

In general, probably, you'd have to replace the part where he writes the LLVM IR directly to say GCC's IR. If you mean no IR at all it doesn't sound like it based on the part about replacing the assembly from LuaJIT.

kryptiskt · on Nov 22, 2022

GCC is written in C++ these days, so something like QBE(https://c9x.me/compile/) would be needed.

cmrdporcupine · on Nov 23, 2022

Not C, but could probably do it with Rust & Cranelift's IR instead of LLVM's. Not as heavy weight of a dependency as bringing in LLVM.

sillycross · on Nov 23, 2022

I just want to point out that LLVM is not a runtime dependency. It is only a build-time dependency if you want to build LJR from source. Once LJR is built, it is stand-alone and does not need LLVM at runtime.

tekknolagi · on Nov 23, 2022

Do other JIT developers who would want to use Deegen need to pull in LLVM as a dependency still?

JonChesterfield · on Nov 24, 2022

It should be a build time dependency. There's no JIT here, so there's no calling into LLVM's JIT, so I'd hope it's equivalent to using clang to create a native binary that implements the lua language.

Could probably retarget it to emit assembly and commit that for a loose interpretation of building from source.

saagarjha · on Nov 23, 2022

Sure, you’d just pick a language that has a flexible optimizing compiler and generate code using that instead.

MuffinFlavored · on Nov 23, 2022

> The problem with the “big switch-case” approach is that C/C++ compilers simply cannot handle such code well.

Is this the case with Rust?

saagarjha · on Nov 23, 2022

Yes, it will run into the same register allocation and unstructured control flow issues.

johncolanduoni · on Nov 23, 2022

Most likely, as the pathological behavior the article mentions is mostly around register allocation and Rust uses LLVM.

farzher · on Nov 23, 2022

Faster than the LuaJIT's interpreter... with the JIT disabled. would've been nice to make that more clear.

Vt71fcAqt7 · on Nov 22, 2022

Can this be done for javascript?

corysama · on Nov 22, 2022

JavaScript doesn't have a standardized bytecode. But, I suppose you could design one, describe it in C++ and run it through this tool. Then you would "just" need a JS source -> bytecode runtime compiler.

LoganDark · on Nov 23, 2022

Lua doesn't have a standardized bytecode either, just so you know. The fact that PUC-Rio Lua uses bytecode at all is almost an implementation detail, besides the fact that you can dump it and reload it later. But the actual bytecode format and instructions are definitely implementation details and not standardized.

Ericson2314 · on Nov 22, 2022

bessiesboobies · on Nov 23, 2022

Do these changes work on Apple M1s too?

saagarjha · on Nov 23, 2022

This is a design improvement; it’s not specific to any one processor.

naniwaduni · on Nov 23, 2022

It does not, in fact, work, though, since the project (currently) only targets x86_64 Linux.

saagarjha · on Nov 23, 2022

I interpreted the question as "could these work" rather than "do these work today".

arc-in-space · on Nov 23, 2022

sillycross, please consider adding a rss feed to your blog. People do still use those.

bobm_kite9 · on Nov 23, 2022

This sounds a lot like the approach taken by GraalVM. Can someone better versed in this area comment?

sanxiyn · on Nov 23, 2022

No, it's almost entirely unrelated to GraalVM's approach.

cztomsik · on Nov 23, 2022

Not in the original scope, but truffle has a way to generate interpreter from a java annotated switch case. And it also generates the bytecode format/length, and if you do changes in the interpreter it can change the bytecode format again.

So I think it is related, if not even the same thing. Plus, GraalVM also does JIT out of box. And it has GC.

That said, it's not all rainbows and unicorns, GraalVM has higher memory usage and is only 10-15% faster in production usage. AFAIK, Twitter is using it in production.

dingdingdang · on Nov 22, 2022

Very very impressive.

presheaf · on Nov 22, 2022

This is very cool.

1letterunixname · on Nov 23, 2022

PGO on a conventional executable with accumulated profiling data is likely to be more performant. JITs also have a problem because they must transmute RW data pages into RO code pages to satisfy NX. PGO enables longer-term analysis that effectively supersedes JIT. JITs are nice in theory, but don't tend to save profiling data to drive optimizations in performance-critical sections that matter.

Might try LuaToCee and then apply PGO iteratively.

saagarjha · on Nov 23, 2022

This is not a JIT.

luablues · on Nov 23, 2022

> Lua is concise yet supports almost every language feature one can find in dynamic languages

Having yet another terrible package manager? A non-existent ecosystem outside of checks notes game scripting and nginx? Reinvent-Everything where every developer everywhere has to reinvent everything poorly because the language is “concise” and “embeddable” and stuck in the 90s? Breaking changes between versions and interprets worse than the Python2/3 fiasco?

Yessir I do them me those “features”.

wruza · on Nov 23, 2022

These points are well-known and many would agree if not for the tone of your comment and the missing context, which is:

I have been working on a research project to make writing VMs easier. <snip>. I chose Lua as the experiment target for my idea, mainly because <quote>

This is the most reasonable choice, because experimenting with a big one would drag all the complexity into the project but not the interesting parts. Which language would you suggest instead and why?

LoganDark · on Nov 23, 2022

None of those are a factor when implementing an interpreter for Lua 5.1. One example given is stackful coroutines, which has nothing to do with package managers, release dates, or the differences between versions.

IOW, Deegen is a proof-of-concept that is being developed on Lua 5.1. Once it is more mature, it will be able to produce other bytecode VMs, too. Lua 5.1 just has enough language features to exercise a lot of capabilities, or in other words, result in Deegen being quite versatile.

nmz · on Nov 23, 2022

What's wrong with luarocks?