In C, a common way to get good performance without JIT is with bytecode and a giant switch statement. It would be interesting to see if that's viable in Rust. How does performance compare to an implementation using closures?
Inko (https://inko-lang.org/) uses this approach for its bytecode interpreter. I can't speak about how well it performs compared to an AST interpreter since I never wrote one for Inko, but generally bytecode interpreters should be faster. I also think bytecode interpreters would be easier to optimise, as you're no longer directly tied to the AST of the input language.
It was even slightly worse than AST interpretation actually - as mentioned in the post, this is a much larger topic that deserves its own attention, but I did include the bytecode case in the linked slides. https://www.slideshare.net/RReverser/building-fast-interpret...
My understanding is that you're doing subroutine threaded code[0] instead of bytecode, which isn't a common choice. Maybe this has something to do with Rust's particular performance profile.
Did you profile your bytecode implementation to see where the time is going?
If you mean in a comparison against AST interpretation, then I'd say that Rust tagged enums (which C doesn't have) are already essentially a bytecode, except properly aligned, so it makes sense that they have same or slightly better performance than you would get with manual bytecode implementation.
I am not sure if I understand this argument. The C "tagged union pattern" is very similar memory-wise to Rust tagged enums, it just is less pleasant to use due to no pattern matching and so on.
I would expect that the advantage of the closure version of your AST interpreter is coming from somewhere else (at the end of the day it is still an AST interpreter). One possibility is that you are passing around the operands to your filters as Rust lexically-closed values instead of using a custom stack for your interpreter, which makes things a bit more "statically typed".
> at the end of the day it is still an AST interpreter
In a pretty remote sense, I guess.
> One possibility is that you are passing around the operands to your filters as Rust lexically-closed values instead of using a custom stack for your interpreter, which makes things a bit more "statically typed".
Yes, that and using native dynamic dispatch instead of walking a tree structure with branching are making this technique much closer to template JITs than AST interpretation, with corresponding performance wins.
The reason I think it is fair to call it an AST interpreter is that there is roughly one function call for each AST node. This isn't necessarily a bad thing though -- AST interpreters are super simple and maintainable (as you have demonstrated) so if you got good performance out of one then it is great!
C certainly has this via unions. It's more manual but provides more control, including choice of alignment.
It's CPU dependent but in practice the cache benefits of packed instructions will probably dominate any misalignment penalty. It's not like x86 instructions are aligned after all.
What's the best way to implement a variable-width bytecode interpreter in Rust?
More control is, as usual, less ability for automatic optimisations. For example, Rust uses knowledge of tag + data combinations for niche-filling enums where possible (and more similar optimisations to come), while LLVM can use this information to more easily deduce which branches are unreachable.
As for alignment - you can control it in Rust enums too, if you wish, by putting custom aligned structures as variant payloads.
In general, I'm sure it would be possible to achieve same (or, who knows, maybe even slightly better) performance than AST interpretation with manual tweaking, but the closure-based approach proved to give significant wins out of the box, while being more elegant to maintain and extend over time.
In a bytecode interpreter's loop, you want your language to be little more than a fancy assembler. Exotic techniques become viable and even critical for performance: computed goto, misaligned reads, pointer arithmetic, unchecked type conversions, etc. This is one place where you want to tightly control your machine code.
Oh, sure, if you're willing to go to unsafe techniques, you can go straight to a template JIT as well. But, for us, the balance is important, and safety is probably even more important than further wins in performance (which we can still easily get with algorithm-level optimisations before doing anything unsafe).
C++ is itself a “cute” name. It seems to be doing just fine.
Beyond that, “crate” specifically has utility. The closest existing name is “compilation unit,” but with things like incremental compilation, that’s a bit of a misnomer. We need some kind of name.
That's not the right way to talk about it. The error was in code that used Ragel but it was all on us. But as part of improving things we have moved away from non-memory safe languages and Ragel was using C.
In a way it was, but it wasn't Ragel's fault. We made an error that ended up with generated C code that contained the bug. Bottom line for us was... use memory safe languages.
As far as I remember not Ragel was the issue, but the code around it.
If you use a Rust parser from inside broken C code which Leaks memory the same thing can happen.
A better way to get good performance is to thread your switch statement, which is hard to do explicitly in Rust last time I tried (maybe you could do this if you mark functions as inlinable?).
The "big switch statement" approach is for each bytecode instruction to complete by jumping to a centralized dispatch location (i.e. the switch statement).
The "threaded" approach is for each bytecode instruction to complete by decoding and jumping to the handler for the next instruction.
Basically instead of "break" you have `goto handlers[nextIp->opcode].`
The advantages of threading are fewer jumps and better branch prediction (since branch prediction is tied to IP). The disadvantages are slightly larger code and compilers struggle to optimize it, since the control flow is not structured.