In C, a common way to get good performance without JIT is with bytecode and a gi...

YorickPeterse · on March 4, 2019

Inko (https://inko-lang.org/) uses this approach for its bytecode interpreter. I can't speak about how well it performs compared to an AST interpreter since I never wrote one for Inko, but generally bytecode interpreters should be faster. I also think bytecode interpreters would be easier to optimise, as you're no longer directly tied to the AST of the input language.

If you're curious, you can find Inko's interpreter loop here: https://gitlab.com/inko-lang/inko/blob/master/vm/src/vm/mach...

RReverser · on March 4, 2019

> as you're no longer directly tied to the AST of the input language

This is the case with closure-based approach too, and indeed was one of the main motivations to switch away from AST interpretation.

RReverser · on March 4, 2019

It was even slightly worse than AST interpretation actually - as mentioned in the post, this is a much larger topic that deserves its own attention, but I did include the bytecode case in the linked slides. https://www.slideshare.net/RReverser/building-fast-interpret...

munificent · on March 4, 2019

My understanding is that you're doing subroutine threaded code[0] instead of bytecode, which isn't a common choice. Maybe this has something to do with Rust's particular performance profile.

Did you profile your bytecode implementation to see where the time is going?

[0]: https://en.wikipedia.org/wiki/Threaded_code#Subroutine_threa...

skybrian · on March 4, 2019

Well, that's disappointing. Any guesses about why?

RReverser · on March 4, 2019

If you mean in a comparison against AST interpretation, then I'd say that Rust tagged enums (which C doesn't have) are already essentially a bytecode, except properly aligned, so it makes sense that they have same or slightly better performance than you would get with manual bytecode implementation.

ufo · on March 4, 2019

I am not sure if I understand this argument. The C "tagged union pattern" is very similar memory-wise to Rust tagged enums, it just is less pleasant to use due to no pattern matching and so on.

I would expect that the advantage of the closure version of your AST interpreter is coming from somewhere else (at the end of the day it is still an AST interpreter). One possibility is that you are passing around the operands to your filters as Rust lexically-closed values instead of using a custom stack for your interpreter, which makes things a bit more "statically typed".

RReverser · on March 5, 2019

> at the end of the day it is still an AST interpreter

In a pretty remote sense, I guess.

> One possibility is that you are passing around the operands to your filters as Rust lexically-closed values instead of using a custom stack for your interpreter, which makes things a bit more "statically typed".

Yes, that and using native dynamic dispatch instead of walking a tree structure with branching are making this technique much closer to template JITs than AST interpretation, with corresponding performance wins.

ufo · on March 5, 2019

The reason I think it is fair to call it an AST interpreter is that there is roughly one function call for each AST node. This isn't necessarily a bad thing though -- AST interpreters are super simple and maintainable (as you have demonstrated) so if you got good performance out of one then it is great!

ridiculous_fish · on March 4, 2019

C certainly has this via unions. It's more manual but provides more control, including choice of alignment.

It's CPU dependent but in practice the cache benefits of packed instructions will probably dominate any misalignment penalty. It's not like x86 instructions are aligned after all.

What's the best way to implement a variable-width bytecode interpreter in Rust?

RReverser · on March 4, 2019

More control is, as usual, less ability for automatic optimisations. For example, Rust uses knowledge of tag + data combinations for niche-filling enums where possible (and more similar optimisations to come), while LLVM can use this information to more easily deduce which branches are unreachable.

As for alignment - you can control it in Rust enums too, if you wish, by putting custom aligned structures as variant payloads.

In general, I'm sure it would be possible to achieve same (or, who knows, maybe even slightly better) performance than AST interpretation with manual tweaking, but the closure-based approach proved to give significant wins out of the box, while being more elegant to maintain and extend over time.

ridiculous_fish · on March 4, 2019

In a bytecode interpreter's loop, you want your language to be little more than a fancy assembler. Exotic techniques become viable and even critical for performance: computed goto, misaligned reads, pointer arithmetic, unchecked type conversions, etc. This is one place where you want to tightly control your machine code.

RReverser · on March 4, 2019

Oh, sure, if you're willing to go to unsafe techniques, you can go straight to a template JIT as well. But, for us, the balance is important, and safety is probably even more important than further wins in performance (which we can still easily get with algorithm-level optimisations before doing anything unsafe).

pcwalton · on March 4, 2019

Rust has C unions now, but without safety. I would probably write manual parsing code via the byteorder crate.

cqtoq · on March 5, 2019

This is a bit off-topic, but it has been on my mind for a while.

Has the Rust team considered that choosing cute names like "crate" might turn away some of the target audience?

The same applies to other ecosystems like homebrew, but in the case of Rust some of the target audience certainly includes C/C++ programmers.

steveklabnik · on March 5, 2019

C++ is itself a “cute” name. It seems to be doing just fine.

Beyond that, “crate” specifically has utility. The closest existing name is “compilation unit,” but with things like incremental compilation, that’s a bit of a misnomer. We need some kind of name.

fulafel · on March 7, 2019

C too (from B, which was a stripped down BCPL).

kstenerud · on March 4, 2019

This is how ragel [1] does things. It basically compiles to giant-switch state machines, and is damn fast.

[1] https://www.colm.net/open-source/ragel/

nicoburns · on March 5, 2019

Ragel was also the source of cloudbleed! So I would definitely understand if cloudflare were wary of using that kind of technique again.

jgrahamc · on March 5, 2019

That's not the right way to talk about it. The error was in code that used Ragel but it was all on us. But as part of improving things we have moved away from non-memory safe languages and Ragel was using C.

nicoburns · on March 5, 2019

Ah, I had thought it was a bug in the code generated by ragel. But I guess not!

jgrahamc · on March 5, 2019

In a way it was, but it wasn't Ragel's fault. We made an error that ended up with generated C code that contained the bug. Bottom line for us was... use memory safe languages.

Matthias247 · on March 5, 2019

As far as I remember not Ragel was the issue, but the code around it. If you use a Rust parser from inside broken C code which Leaks memory the same thing can happen.

jgrahamc · on March 4, 2019

Agreed. We used to use Ragel for some tasks and it was fast.

saagarjha · on March 4, 2019

A better way to get good performance is to thread your switch statement, which is hard to do explicitly in Rust last time I tried (maybe you could do this if you mark functions as inlinable?).

ufo · on March 4, 2019

With current branch predictors threaded code might not make a big difference like it used to.

https://hal.inria.fr/hal-01100647/document

skybrian · on March 4, 2019

What do you mean by "thread your switch statement"?

ridiculous_fish · on March 4, 2019

The "big switch statement" approach is for each bytecode instruction to complete by jumping to a centralized dispatch location (i.e. the switch statement).

The "threaded" approach is for each bytecode instruction to complete by decoding and jumping to the handler for the next instruction.

Basically instead of "break" you have `goto handlers[nextIp->opcode].`

The advantages of threading are fewer jumps and better branch prediction (since branch prediction is tied to IP). The disadvantages are slightly larger code and compilers struggle to optimize it, since the control flow is not structured.

filereaper · on March 4, 2019

This method of design is called a Continuation Passing Style Interpreter. [1]

Here's a production version from OpenJ9's JVM ByteCode Interpreter. [2]

[1] https://kseo.github.io/posts/2017-01-09-continuation-passing...

[2] https://github.com/eclipse/openj9/blob/01be53f659a8190959c16...

anisppp · on March 4, 2019

https://en.wikipedia.org/wiki/Threaded_code

smuser · on March 4, 2019

I think he means have a different thread for each case or a subset of cases instead of having any explicit switch statement.

Thaxll · on March 4, 2019

Looks like what emulator are using.