First, the fastest interpreter is never switch or "jump threading" (i.e. computed goto) based. It is either based on passing the next pointer from one opcode to the next (e.g. perl5) or using an opcode array (lua). Those two cache much better and don't need the dispatch overhead with a jump table at all. Not talking about asm tuned interpreter dispatch yet, which count as best of the best.
Second, the famous cited Ertl paper mostly suggested to use smaller opcodes to influence the icache. Lua and luajit do this with one-word ops and blow all other interpreters away.
Using the computed goto extension only helps in bypassing the out of bounds check at the start of the switch dispatch, and gains the famous 2-3%, and less on more modern CPU's. When talking about the differences, the differences should be explained.
Summary: This research result is not useful for state of the art interpreters, only for conventional slow interpreters. And for those the result should be ignored at all, as much better interpreter speedup techniques exist, not noted in this paper.
Since you seem to know a lot about this, which popular interpreters would you say are "state of the art" and which are "conventional slow"? It seems that lua and luajit (and perhaps perl5?) are in the former category. Is everything else in the latter?
tl;dr for those who don't want to read the paper in detail:
* Interpreters are conceptually a big switch(bytecode) statement in an infinite loop. They use various tricks to make it easier for the processor to predict the target of the next indirect branch
* This previously had a dramatic impact on performance, because branch mispredictions are very expensive and it was difficult for the processor to predict them accurately.
* Recent Intel architectures have improved branch prediction, so the aforementioned tricks no longer have a very big impact (on Haswell x86 processors specifically)
"Erlang does (that) for building on Windows. They use MSVC for most of the build, and then GCC for one file to make use of the labels-as-values extension. The resulting object code is then hacked to be made compatible with the MSVC linker."
It obviously mattered enough for Erlang developers to use GCC even for just that critical piece of C code. It doesn't mean that it has to be regularly used in every code (not every switch is executed the way the main interpreter loop is).
I'd be interested to see benchmark numbers for LuaJIT, which makes it extremely easy to compile with either replicated dispatch (each bytecode instruction footer has a separate dispatch "jmp" instruction) or common dispatch (each bytecode instruction jumps to a common dispatch "jmp" instruction): http://repo.or.cz/w/luajit-2.0.git/blob/a5b1c4d98eeb97a95077...
// Instruction footer.
.if 1
// Replicated dispatch. Less unpredictable branches, but higher I-Cache use.
.define ins_next, ins_NEXT
.define ins_next_, ins_NEXT
.else
// Common dispatch. Lower I-Cache use, only one (very) unpredictable branch.
// Affects only certain kinds of benchmarks (and only with -j off).
// Around 10%-30% slower on Core2, a lot more slower on P4.
.macro ins_next
jmp ->ins_next
.endmacro
.macro ins_next_
->ins_next:
ins_NEXT
.endmacro
.endif
If my coworker would come to me and show me his results where he measured that the "Labels as Values" implementation is not need while "on his Haswell CPU the speedup is only 3%" I'd just ask "can it be guaranteed that the code you propose to revert to the plain switch will run only on that CPU?"
If not and the switch is in the performance sensitive place I'd keep the "Labels as Values" implementation.
There are many more CPUs in the world than "the latest Intel x86 CPU." Kudos for the developers of Haswell, but don't blindly trust the Inria researchers that tell you "Don’t Trust Folklore."
I also don't like that this research was "partially supported" by the European grant thus explained: "ERC Advanced Grants allow exceptional established research leaders of any nationality and any age to pursue ground-breaking, high-risk projects that open new directions in their respective research fields or other domains." (1)
Is ground-breaking, high-risk research discovering that the algorithm in the latest Intel processor matches the one other researchers (which I respect much more) discovered in 2006, and variant of which won more "branch prediction championships"? I'd say no.
At least they honestly write about the fact that the given algorithm is already known as highly efficient.
> the one other researchers (which I respect much more) discovered in 2006
You seem to be blissfully unaware that they're in fact the same researchers: André Seznec is a co-author of the paper.
This paper is valuable in pointing out that ITTAGE branch prediction performance is a very good predictor of Haswell performance. Because the Haswell algorithm is secret, that should be very helpful to developers who still have to care about branch prediction.
Interpreter developers shouldn't ignore pre-Haswell chips just yet, but if you were to, say, develop a new language your design decisions should be guided by where the puck will be rather than where it used to be.
> You seem to be blissfully unaware that they're in fact the same researchers: André Seznec is a co-author of the paper.
He's the third signer of the current paper but the first of the 2006 one. It's the case of legitimizing the "research" which only confirms exactly what Intel recently implemented in a given CPU generation, even if the grant is for other goals ("ground-breaking, high-risk projects"). Or, to be clearer, I respect the deeds not the persons.
I fully understand the need for Seznec to be able to claim that the algorithm is really used by Intel and that therefore he co-authored the last paper. But confirming that Intel used ITTAGE is "ground-breaking, high-risk"? No. And does that mean we should never use goto label in the tight loops? No.
The grant is for the DAL project [1], which presumably fits the "ground-breaking, high-risk" label. In such a project, some tasks will be high-risk, some tasks will be low-risk. Quantifying the branch predictor of current processors may be a comparatively easy task, but that doesn't mean it's trivial, useless or outside the scope of the project, which is to improve sequential performance of microarchitectures. Knowing that real-world processors are performing just as well as previous academic research is helpful, because it suggests that there is no difficulty or unrealistic assumption that prevented manufacturers from doing so. Conveying that knowledge to the compiler and interpreter community is important too.
While I somewhat agree with you, the real answer is assume nothing and profile. Always. But there's a catch. Things are getting a bit out of hand, there are just too many architectures and configurations.
I think modern consumer-oriented code should run at least well on Intel Nehalem - Skylake, AMD K10 - K12, ARM Cortex A7/A8/A9/A15/A53/A57/A72, Qualcomm Krait/Kryo.
Focus on 64-bit, but some attention should still be paid to 32-bit performance.
Number of actual targets is even worse, because cache and memory architecture vary so much. At least all those have 64-byte cache lines.
In high performance code, there's often the choice between memory and compute load balancing. What memory access patterns and layouts you're going to use - where do you want your bottlenecks? You can often reduce bandwidth requirements by computing more and vice versa.
Take Sandy Bridge for example:
While L1 and L2 cache sizes are fixed on Sandy Bridge, L3 cache varies between 1 - 20 MB.
Clock speeds vary between 1.0 to 3.6 GHz.
There are 1-4 memory channels, I believe they're usually arranged 64 byte granularity round robin. Say your access pattern is access every second 64-byte cache line vs. access every line. On 1 and 3 channel systems, both perform about same. When skipping 64 byte cache line on 2 channel system, effective memory bandwidth is equal to just 1 channel system. 4 channel system will behave like 2 channel one. The code might be bandwidth bottlenecked on 1-2 channel system and compute bound on 3-4 channel system.
Then you're going to hit DRAM page miss every... 1-8 kB? System dependant, of course.
To make the best choice for Sandy Bridge based systems, among other things, you need to know number of memory channels + their arrangement, amount of L3 cache and clock speed. On some Sandy Bridge CPUs you have a lot of time to do complicated data packing and unpacking, trying to just keep memory bandwidth usage down.
That was just one architecture. Now add all the other platforms in the mix. There are pretty many different instruction set extensions. If you don't use them properly, you can lose an order of magnitude of performance.
I think we'll need runtime code generation in the future. On system JIT/AOT also for traditionally compiled languages, such as C and C++.
They did profile, but what? Their "don't trust folklore" ("just 3% speedup") conclusion is valid only for one specific architecture and only for the most recent CPUs. As you say
> modern consumer-oriented code should run at least well on Intel Nehalem - Skylake, AMD K10 - K12, ARM Cortex A7/A8/A9/A15/A53/A57/A72, Qualcomm Krait/Kryo.
If I understand correctly, they themselves measure 30% speedup even only one generation before on Intel and they didn't measure any ARM or AMD.
It looks to me that "don't trust folklore" claim was possible only by ignoring all the processors you name.
And you are right, they also didn't show that they are aware of the differences inside of one generation (does the cheapest processor of the same generation behave the same?)
> Their "don't trust folklore" ("just 3% speedup") conclusion is valid only for one specific architecture and only for the most recent CPUs.
For now. In 5 years from now, you can be sure most new architectures will have predictors that are as good. So, while it might be good to reach for the low hanging fruit (jump threading), heavier approaches such as super-instruction replication are probably no longer a good idea.
Writing the code in a way not to be dependent on the most advanced hardware solution in existence is better choice unless you target only the specific hardware or the solution is of the "use once then throw away" kind.
That would be true, if we were speaking of using this optimization instead of that optimisation.
We're not. Here, they're saying that in this particular architecture, a number of optimisations are effectively useless. They're saying that naive code is almost as fast. They're saying that some "optimizations" have even become counter-productive.
The lesson I get from this? Unless I know a fair bit about the target platforms, the naive approach is better, because any performance model I have in mind might be invalidated anyway.
They actually benchmark the Python interpreter. They show that only on Haswell the gain from using the goto label is low. For me, that's not an argument to never use goto label, especially not to remove it from the Python implementation which certainly should run on many other CPUs than Haswell.
Edit: therefore I don't agree with your "unless I know a fair bit about the target platforms, the naive approach is better." I'd agree however with "it the speed (or the battery use) actually doesn't matter, the naive approach is better." Sure.
Good thing they don't advocate not using goto label, nor removing it from the Python implementation. Heck, they're not even saying we should stop using goto labels on new projects.
Besides, goto label is a really low hanging fruit that hardly complicates your interpreter, and has no negative impact. They discuss heavier optimisations, some of which are slower on Haswell. Those might not be worth their while any more.
Also, "only on Haswell" won't apply for long. I give AMD 2 years to keep up with Intel's mighty branch prediction, if they haven't done so already. Mobile platform may be different given the energy requirements, though. I don't know.
So do you still write code for 386 and 486 generations? There is a point where previous CPUs drop off as a concern. All bets are off if you do cutting edge CUDA work, then you only care about the GPU you are running on today because the one you were using yesterday is already outdated, anyways.
instead of the plain switch if I'd write an interpreter in C (that's the topic of the article!) for a language that I know will be executed on ARM, AMD, Atom, you name it. The answer is: you bet I'd use the goto labels. That one particular CPU generation doesn't get much speedup from that construct doesn't mean that I want to make all other significantly slower. It's not an excuse and I'm not that lazy under these circumstances. If I would not care for performance that much, I would not even use C.
My point was that eventually the technique will fall out if use of newer generations don't benefit from it. So say today X does no good on brand new hardware, then we still do X because not all hardware is new, but the value of doing X will definitely diminish over time.
What is really annoying is when X is done to support an out-of-date CPU like a 386, to the detriment of CPUs that are actually currently in use. It does happen...
There's one caveat: advanced branch prediction is probably not free. It probably costs a bit of silicon, even some energy. It may or may not actually matter (I'm no hardware designer), but if it does, we may have to make some hard choices, like a few more cores vs badass branch prediction.
It may not matter on X86 specifically, where out of order stuff dominates anyway. But it might matter for stuff like the Mill CPU architecture, for which energy efficiency and not wasting silicon is very important.
I wouldn't be surprised that in this particular case it's even more about the patents and the possibility to have an advantage to the competition than the silicon. Anybody knows if there is a patent? If there is, who holds it? How much would the CPU which would implement that method cost more?
If someone identifies an optimum (for performance, efficiency), all competitors are forced to go there to stay competitive. We already have more cores then we know what to do with, and better single core performance is always appreciated, so badass branch prediction probably wins out IF they find that it is useful at all (these days, meaningful gains on single core are difficult to eke out).
Is for you the "brand new hardware" your desktop which spends hundred of watts per hour or your mobile phone which has to survive the whole day without recharging his 5 Wh battery but still let you surf the web with all the stuff you take as given? Making optimizations that work good with lower wattage CPUs is a good thing, unless you limit yourself only to the non-battery devices.
Mobile architectures are typically only a tock away from desktop architectures, who aren't really guzzling energy with these optimizations (the deep pentium 4 pipelines are behind us!). So if technology marches forward, there isn't some mobile version that is frozen in time just because of energy efficiency (indeed, good branch prediction saves energy also). There are no 386 or 486 CPUs to care about now, while P5s are limited to that Phi HPC product.
The fact that the performance of an implementation varies significantly depending on the actual CPU could already be deduced from some of the Ertl papers.
Moreover, given that one uses specifically a bytecode interpreter and not a JIT is generally the sign that the interpreter is to be used on multiple platforms. "Optimizing" in this context simply means (almost) nothing.
The obvious conclusion is to not care too much about raw performance and focus instead on ease of interfacing with libraries, and to make it easy to add primitives/instructions to your VM. With the idea that if your interpreter is too slow, you just rewrite the critical parts in native code.
> you just rewrite the critical parts in native code.
Rewrite critical parts of interpreter in native code? My point was there's no optimal native code anymore.
Pure interpreters are falling out of fashion anyways. For JITted systems, there's a significant cost for calling native code. At least until JITs actually inline natively called code, possibly even through dynamic library (.so, .dylib, .dll, etc.) call.
We need a lightweight, thin profile guided JIT/AOT engine. No standard library, memory management agnostic. Something that can target different architectures, memory & cache configurations and instruction set extensions that may have been unknown at design phase. A compilation target for some C/C++/Rust/etc. and a runtime that takes care of the architecture details.
> Rewrite critical parts of interpreter in native code? My point was there's no optimal native code anymore.
No, I mean rewrite the critical parts of the "interpreted" program as interpreter primitives/instructions: the best way to eliminate interpreter overhead is to merge N primitives into one.
But I realise we have a different perspective. You seem to consider interpreter VMs as off the shelf components, when I consider VMs as custom components because I'm working mainly with embedded systems. In the embedded world, JIT is often simply not an option either because they are not available for your particular target, or because the resource constrains don't allow it. But still you sometimes want to trade resources for ease of development and/or flexibility.
Right, I also do work in embedded (without OS), so I'm familiar with the limitations. However, there are situations where it's tempting to write a very primitive JIT or at least glue pieces of code together in a buffer and execute it, to get good inner loop performance by making it as few instructions as possible.
So far C has been enough. Achieving low (=preferably none) defect rate using C is challenging enough.
Usually FPGAs can take care of performance critical part. They do increase total cost, though.
> Is ground-breaking, high-risk research discovering that the algorithm in the latest Intel processor matches the one other researchers (which I respect much more) discovered in 2006, and variant of which won more "branch prediction championships"? I'd say no.
That appears to be a pretty reasonable criticism of the process by which the grant was awarded.
My somewhat unfair characterisation of this research from skim reading the abstract and conclusion is: "we point out that the understanding in the academic literature is lagging behind the understanding already built into systems that are sold commercially and are in operational use".
edit:
There's something slightly humorous in considering the grant-allocation process itself as some kind of optimisation problem:
Problem (GA) : How do we best allocate resources to promote high-risk, ground-breaking research?
...and then observing that grant resources have been allocated to perform research on another piece of optimisation that already seems to have happened.
I wonder if one could obtain an ERC Advanced Grant to study the effectiveness of the current process by which ERC Advanced Grants are awarded, and suggest improvements? It is plausible that such a study could be very valuable and yet not have much political support.
There are a number of reasons this is not openly available. Just a few:
- It may lock the programmer into a specific CPU
- The CPU cache/pipeline is less generic than you might think. Making setting a standard way to change it from within a program problematic.
- Multiple programs may be running at the same time. Having the a program dictate cache behaviour would effect all running applications.
- Arbitrary changes to behaviour would have performance implications when the change is made. In a multiple application situation that could mean flipflopping behaviour.
Generally though the benefit of the manual control wouldn't be worth the cost/risk/loss of platform support for most applications as CPU behaviour is very good and it would probably be cheaper to just upgrade to a more expensive CPU than to spend the time on developing improvements to the algorithm.
In the rare case where you do need specific behaviour and it is worth while it would be worth speaking to the vendors rather than doing it yourself.
In addition to what others have said (mainly, it would be specific to a particular version of the CPU, which is probably the biggest thing -- imagine if no x86 binary older than ~1 year could run on your chip), one more point:
The programmer doesn't necessarily have better knowledge than the CPU, because the programmer can't see or respond to runtime behavior. There are a lot of cases where behavior is input-data-dependent, or simply too complicated to reason about analytically a-priori. The big wins in computer architecture over the past two decades have all been in mechanisms that adapt dynamically in some way: dynamic instruction scheduling (out-of-order), branch prediction based on history/context, all sorts of fancy cache eviction/replacement heuristics, etc.
Itanium tried "VLIW + compiler smarts" and the takeaway, I think, was that it was just too hard to build good enough static analysis to beat a conventional out-of-order CPU.
The problem with the OoO is that it's a way too expensive (in terms of power and area) in many cases. You won't ever see OoO in GPUs and DSPs, unlikely in the microcontrollers. So, VLIW and the other "stupid core, smart compiler" approaches are still legitimate and will always remain valuable.
Yup, in some domains it definitely still makes sense. GPUs work well for highly-data-parallel applications (they're essentially vector machines, modulo branch divergence) and VLIW-style DSPs work because the code is numerical and easy to schedule at compile time.
I've worked mostly in the "need performance as high as possible for general-purpose code" domain, so I may be biased!
Ever hear of this process called MIPS? The design behind MIPS (Machine without Interlocking Processor Stages) was originally based on the notion that every time the hardware needed to introduce a bubble, the CPU would instead continue executing instructions--giving rise to load slots and branch delay slots.
The resulting finds from practice were that these new slots were hard to fill (i.e., mostly nops), and, instead of making hardware simpler, they made hardware much more complex if you ever decide to adjust microarchitectural details.
There is some limited support for poking the cache, though (prefetching, flushes, bypass, load streaming), but all of those techniques tend to be very coarse-grained.
They have been produced, but x86's dominance has always been about fungibility and idiot-proofness while maintaining acceptable performance. It's never been the fastest possible or cheapest possible architecture; it's the architecture whose success has been built upon hiding all the complicated bits behind the curtains, so that software people can worry about software and not hardware.
There actually are some performance upsides to that too. It allows the chip architects to radically change the underpinnings of the chip with impunity, doing whatever works best on the current process node and using the latest advancements from computing research.
Another reason might be security: There are already attacks known, that are using the CPU cache and/or instruction pipeline to break the boundaries of virtualization. With today's virtualizable processors, it becomes more and more important, that the users can not access to much of the processors implementation details.
There are architectures that require manual pipeline control. MIPS, for example, or any VLIW architecture (including many of the modern GPUs). Cache is also often programmable (see scratch memory in Sparc, local memory in GPUs, etc.)
The article is missing one important difference between a primitive Python use of jump threading and the OCaml bytecode interpreter which pre-compiles the bytecode into a threaded code. A difference in performance between these two is huge.
> the OCaml bytecode interpreter […] pre-compiles the bytecode into a threaded code
Wait a minute, how is that even possible? Or rather, what do you mean?
Here's my layman's view of things: the bytecode is basically a string of opcodes. The interpreter has to look up the opcode before executing it somehow. And it can be threaded, or use an ordinary switch, or perform some other bizarre optimization I don't understand.
But what does it mean for the bytecode itself to be compiled into a threaded form?
> Wait a minute, how is that even possible? Or rather, what do you mean?
It means that the stream of bytecodes (32-bit numbers in OCaml) is replaced with the corresponding label addresses (for 64-bit hosts - addresses minus a fixed offset) once. Then the execution is trivial. This is called "indirect threaded code" [1] and used a lot in Forth implementations, for example.
See the definition of the "Next" macro in interp.c [2] for details.
Which means, instead of having the compiler constructing a jump table under the hood, I do this myself, and get the benefit of jumping from several locations instead of just one. But I still look up that table. Now the indirect threading you speak of:
If I got that correctly, instead of accessing the jump table at some random place, I only access the label table in a much more linear fashion, saving one indirection and some memory access in the process —this should relieve some pressure off the L1 cache.
Exactly, you got it right. One indirection less, and no jump table in the cache. And CPUs are very well optimised for this kind of indirect jumps, because of vtables.
There is also one step further - emit the actual jump instructions, making a direct threaded code. It is a bit more complicated and not very portable, but if you want to squeeze all the cycles you can it's still an easy and fast option before resorting to a full-scale compiler.
I think direct threading is usually implemented just by reducing the indirection another level,
e.g.
goto *lp
rather than
goto **lp
You could implement an interpreter with generated JMP instructions, which might be called direct-threading, but at least in the forth world, where most jumps are to nested user-defined subroutines, subroutine-threading using JSR instructions is typically used.
First, the fastest interpreter is never switch or "jump threading" (i.e. computed goto) based. It is either based on passing the next pointer from one opcode to the next (e.g. perl5) or using an opcode array (lua). Those two cache much better and don't need the dispatch overhead with a jump table at all. Not talking about asm tuned interpreter dispatch yet, which count as best of the best.
Second, the famous cited Ertl paper mostly suggested to use smaller opcodes to influence the icache. Lua and luajit do this with one-word ops and blow all other interpreters away.
Using the computed goto extension only helps in bypassing the out of bounds check at the start of the switch dispatch, and gains the famous 2-3%, and less on more modern CPU's. When talking about the differences, the differences should be explained.
Summary: This research result is not useful for state of the art interpreters, only for conventional slow interpreters. And for those the result should be ignored at all, as much better interpreter speedup techniques exist, not noted in this paper.