Honest question, I do not know Ruby's semantics well. But, as someone who has worked on many JITs in the past, how is it in these results, three different JITs failed at getting more than a 2x performance improvement? Normally, a JIT is a 10-20x improvement in performance, just from the simple fact of removing the interpreter dispatch loop. What am I missing?
YARV (Ruby's VM) is already direct threaded (using computed gotos), so there's no dispatch loop to eliminate. YARV is a stack based virtual machine, and the machine code that YJIT generates writes temporary values to the VM stack. In other words, it always spills temporaries to memory. We're actively working on keeping things in registers rather than spilling.
Ruby programs tend to be extremely polymorphic. It's not uncommon to see call sites with hundreds of different classes (and now that we've implemented object shapes, hundreds of object shapes). YJIT is not currently splitting or inlining, so we unfortunately encounter megamorphic sites more frequently than we'd like.
I'm sure there's more stuff but I hope this helps!
I've seen different people mean different things by this, do you mean the IR is a list of bytecode handler addresses, and then the end of every handler is a load+indirect jump? Or is there also a dispatch table? In my experience the duplication of the dispatch sequence (i.e. no dispatch "loop") is worth 10-40% and then eliminating the dispatch table on top of that a bit more.
CPUs work hard to predict indirect branches these days, but the BTB is only so big. Getting rid of any indirect call or jump, regardless if that is through a dispatch table, is a big win, perhaps 2-3x, because CPUs have enormous reorder buffers now and can really load a ton of code if branch prediction is good, which it won't be for any large program with pervasive indirect jumps.
> it always spills temporaries to memory. We're actively working on keeping things in registers rather than spilling.
In my experience that can be a 2x-4x performance win.
> It's not uncommon to see call sites with hundreds of different classes
Sure, the question is always about the dynamic frequency of such call sites. What kind of ICs does YARV use? Are monomorphic calls inlined?
> I've seen different people mean different things by this, do you mean the IR is a list of bytecode handler addresses, and then the end of every handler is a load+indirect jump? Or is there also a dispatch table? In my experience the duplication of the dispatch sequence (i.e. no dispatch "loop") is worth 10-40% and then eliminating the dispatch table on top of that a bit more.
It's the former. Each bytecode is the handler address and every handler does a load + jump. There's no dispatch table (though there are compilation options that allow you to use a dispatch table, but I doubt anybody does that since you'd have to specifically opt in when you compile Ruby).
> Sure, the question is always about the dynamic frequency of such call sites. What kind of ICs does YARV use? Are monomorphic calls inlined?
In one of our production applications, the most popular inline cache sees over 300 different classes and ~600 shapes (this is only for instance variable reads, I haven't measured method calls yet but suspect it's similar).
The VM only has a monomorphic cache (YJIT generates polymorphic caches), and neither the VM nor the JIT do inlining right now.
Thanks for the replies. I could keep picking your brain, but maybe it's more efficient for me to read some documentation. Are there some design docs or FAQs or summaries of the execution strategies that you can point me to? Thanks.
> In my experience that can be a 2x-4x performance win.
What's the state-of-art in reg allocation? I see that the Android Runtime makes use of SSAs to allocate registers in linear-time [0]. Are other language runtimes pushing the boundaries further and in different ways?
Author of Ludicrous JIT here (one of the earliest Ruby JITs).
It is easy to get a 10-20x speedup, if you limit yourself to compiling a subset of Ruby. When I first wrote Ludicrous JIT, I saw huge gains, but as I implemented more of the language, performance improvements over MRI (and later YARV) became more modest. Off the top of my head:
Implicit promotion from Fixnum to Bignum adds run-time type checking and overflow checking for integer math. This significantly cuts into performance on math-heavy benchmarks.
Ruby does not store Fixnums and Floats in their native representation. This means they must be converted from Ruby's internal representation when doing math.
Anything that uses eval prevents local variables from being optimized away, stored in registers, etc. The call to eval may lie outside the method being compiled. For example, a method may pass a block to another method. The other method may convert that block to a Proc, which it can use as a binding when calling eval (I don't know if Rails still uses this idiom, but this was one of the reasons why performance gains when running Rails on JRuby were more modest than otherwise might be expected).
Any program that uses set_trace_func can get a binding for any method invoked while the trace func is active. A Ruby implementation that supports set_trace_func is severely limited in what optimizations it can make. IIRC JRuby disables set_trace_func by default.
Exceptions in Ruby are implemented using setjmp/longjmp (or at least they used to be -- I haven't done low-level ruby programming in a long time). Other languages can use zero-cost exceptions without breaking backward compatibility.
In CL, classes can be redefined at runtime and the changes effect already instantiated instances! This is just an example, not the only dynamic feature of course.
Common Lisp has SBCL which generates very fast AOT native code despite the dynamic nature of the language, so I am not sure that being dynamic is a great excuse for being slow.
My impression is that Ruby is more dynamic than pretty much everything else. I think this is true in terms of language features, but also in terms of the style that code is written in practice.
My greatest practical frustration from this is the difficulty of trying to claw back performance when it becomes important. I'd like more ways to say "from this point on, none of X, Y, or Z will change", and get some performance guarantees in return. For example, we have some code that dynamically generates a whole lot of classes from protobuf definitions. It's bad enough that it takes nearly a minute just to load an run all that code, but even after that I'm paying the cost of assuming that any of those definitions might change at any moment. So I have awful load times, and awful runtime performance.
I guess what I'm asking is: do you see a future where there is more explicit control afforded to people who want to pick their own tradeoffs without resorting to writing everything performance-sensitive in extensions written in C/Rust/whatever?
> I guess what I'm asking is: do you see a future where there is more explicit control afforded to people who want to pick their own tradeoffs without resorting to writing everything performance-sensitive in extensions written in C/Rust/whatever?
> do you see a future where there is more explicit control afforded to people who want to pick their own tradeoffs without resorting to writing everything performance-sensitive in extensions written in C/Rust/whatever?
In Ruby: probably not. In general: yes. Julia is probably the closest we have to this today with it's gradual typing. I would like to see more of this (and I suspect we will at some point).
JSVMs will optimize top-level script variables to assumed-const and then inline said constant into accesses that are known through scope resolution to bind to those globals, deoptimizing that code if the global is ever modified. Is Ruby dynamically scoped in a way where that is infeasable?
Not as far as I can tell. It is very dynamic but so is Javascript (and Self).
I think the problem is different: Ruby is severely underspecified and the only way to get good enough compatibility is to piggyback on the official implementation with its interpreter, gc, libraries, C interface, build tools, package management, etc. I also think, but this is just a hypothesis, that most programs use library code implemented in C almost all the time, precisely because Ruby is so slow. That means a JIT has to reimplement lots of library code in a way that makes it JITable/inlineable and that is a lot of work + it's hard to keep it exactly compatible.
It is hard to do that in a way that removes all the (probably very obvious) inefficiencies.
Python has the same problem + it used to have a JIT-hostile leadership. Ruby is quite friendly towards JITs.
One potential option would be to rewrite those bits in Crystal-lang. The languages are often code-compatible, and it doesn't sound like that task has a lot of external dependencies.
C2 was a clean rewrite by the Rice folks and C1 was, afaik, part of the Animorphic acquisition, which was written from scratch, though by Lars Bak and co who did indeed work on Smalltalk before. But AFAICT all they reused was the assembler.
Sure, I also don't mean that the code was taken 1:1, rather that there are a couple of languages that are as dynamic, with relatively good performance on dynamic compiler implementations.
They didn't profile against TruffleRuby, which does indeed get massive speedups of the type you're expecting. It's just really hard to JITC Ruby to the level you'd expect having worked on V8, and GraalVM is the only VM that can do it. However the Ruby community seem to want having their own JIT written in C more than they want performance.
> However the Ruby community seem to want having their own JIT written in C more than they want performance.
YJIT is written in Rust, not C, but it's also not just a matter of wanting to write our own JIT for fun. There are a number of caveats with TruffleRuby which make a production deployment difficult:
1. The memory overhead is very large. Can be as much as 1-2GB IIRC.
2. The warm-up/compilation time is much too long (can be up to minutes for large applications). In practice this can mean that latency numbers go way up when you spin up your app. In the case of a server application, that can translate in lots of requests timing out.
3. It doesn't have 100% CRuby compatibility, so your code may not run out of the box.
There's a reason why you don't see that many TruffleRuby (or TrufflePython, TruffleJS, etc.) deployments in the wild. Peak execution speed after a lengthy warm-up is not the only metric that matters.
W.R.T. memory usage that's true, but I think they've been making big improvements there lately with things like node inlining so it may not be true in the near future.
W.R.T. 2, what are you comparing against here? Is the TruffleRuby interpreter that much slower than the CRuby interpreter, also once you use the native image version? Because it seems like once it starts compiling the hotspots the program must get faster than a purely interpreted version. Whilst it may take minutes to reach peak performance for that engine, how long does it take to reach the same performance as YJIT?
W.R.T. 3, yes, but is it easier to fix that or to develop a new JIT from scratch? What is your solution to the C extensions problem for example? My understanding is that this is a major limit to accelerating Python and Ruby without something like Sulong and its ability to inline across language boundaries.