- Finds instruction boundaries (because of variable-length encoding)
- Branch predicts any branches in the fetch group
- Decodes 3-4 instructions in the fetch group
- Issues 3-4 decoded operations to the rest of the pipeline
This whole process has to happen every cycle (though it's pipelined over 3-5 cycles in the front end), and the performance of this process gates the rest of the pipeline. As a result, many of the micro-architectural "gotchas" in modern x86 processors tends to be in the fetch/decode phase.
You are overgeneralizing. Being front-end bound is true for some code but not for others.
We see our gotchas typically in terms of not being able to issue instructions due to data dependencies and filling up the RS with things that can't make progress (informally, a big tree of "ORs" where one source for everything has missed L1 and held everyone else up).
Other people will be hopelessly bound on cache or memory bandwidth, or floating point multiply, or whatever. Or disk access :-)
If one is curious about whether this applies to one's own code, you don't have to speculate - just bust out the Optimization Guide (http://www.intel.com/content/www/us/en/architecture-and-tech...) and go figure out where your bottlenecks are by looking at counters. VTune can do some of this stuff for you.
I'm speaking from an architectural design standpoint. Exceeding the ability of the OOO engine to cover cache misses isn't a "gotcha" it's a design limit of the hardware. Being able to only predict three branches in a 16-byte fetch group, even though more than 3 branches can fit in a 16-byte stretch of code, is a gotcha. Being able to execute only 3 SSE operations per cycle is a design limit. Not being able to hit 3 SSE operations per cycle depending on which registers you use and how many prefix bytes they require to encode is a gotcha.
Modern Intels also cache decoded uops to alleviate this. So if your (say) main loop is less than about 1500 uops, decoding performance becomes less relevant. This is particularly important for SIMD instructions, which are often quite large (> 5 bytes) and thus can easily bottleneck on the decoding step.
Things like this are why I got out of writing ASM for anything back in the 1990s. It's gotten to the point where while a tight loop might be doable by a person in a performant way, anything else might entirely be far too complicated to make it perform well all over. Along with that you can't make it portable on other architectures.
> Along with that you can't make it portable on other architectures.
Not only that, but it seems from other stuff I've read that you'd have a hard time making x86 code performant on more than one microarchitecture or family thereof, which means re-writing the same code for different manufacturers and different generations from the same manufacturer.
It seems that if you accept the façade that says "x86 is x86" you're better off writing in C or even Haskell, but if you peel the façade back to write fully performant code you're doomed to either keep re-writing it or doomed to see it languish as processor technology moves on.
I'm immediately reminded of the early RISC design concept of "moving the microcode into the compiler"; having extremely simple hardware that explicitly exposed design details like pipeline length in the form of delay slots becoming ISA features and saying that compilers would handle all the resulting complexity much like how microcode handled it in the System/360 and similar.
It turns out architectural branch delay slots were a mistake, but compilers have gotten better at turning obvious source code into performant machine code unless you need so much performance you have to dig into architectural details anyway.
It's unavoidably complicated. And now we're moving towards multi-core CPUs, which bring to the fore another area which is unavoidably complicated.
This is spot on. We use a lot of intrinsics, as we want to get at some of the underlying features of the architecture, but we almost never write asm. It's becomes obsolete almost immediately and is often far worse than the compiler's first cut.
We had a former performance guru who once put a huge #ifdef I_DONT_CARE_ABOUT_PERFORMANCE around the "naive" C version of a loop and had his favourite x86 asm version on the #else branch of the ifdef. Needless to say, much hilarity resulted when defining this macro improved performance. I suspect that the naive version may have been a bad idea at one stage (Pentium 4, perhaps) but became a better version from Core 2 onwards...
Ocassionally I'll get on a wild tear and think "Oh, I can do better than the compiler" after it generates some ridiculous-looking instruction sequence that seems to not be using enough registers (or whatever). I then find that hand-writing and hand-scheduling everything in accordance to what looks really good to me - actually works worse than the insane-looking code in the compiler. Put it down to stuff like store-to-load forwarding, dynamic scheduling, register renaming, etc.
> Ocassionally I'll get on a wild tear and think "Oh, I can do better than the compiler" after it generates some ridiculous-looking instruction sequence that seems to not be using enough registers (or whatever).
Compilers are very good in some things like instruction scheduling and register allocation but only poor to mediocre in other things like instruction selection. In addition, compilers are not allowed to do some optimizations, like change the order of function calls to other translation units or do arithmetic optimizations with floating point numbers.
So if you have a piece of performance critical code, you can still be smarter than the compiler alone by co-operating with your compiler by writing compiler-friendly close to the metal code. Inline assembly code is not good for the compiler, it can't really optimize it at all. But using CPU and compiler specific intrinsics to use your hardware capabilities (SIMD, special cpu instructions) you can get your code to run close to the speed of light without compromising readability or maintainability.
Yes, compilers these days are smart but thinking that you can't do better than that is a lazy man's fallacy.
"Inline assembly code is not good for the compiler, it can't really optimize it at all."
Actually it can, if you have a way to specify register usage (gcc does) and the programmer gives the right specification. Unfortunately, that almost never seems to happen in practice.
Truth. I only use asm for stuff I can't do in C, like access architecture-specific registers or use instructions for which there are no C operators. Anything else strikes me as a bit of an exercise in machismo at the expense of sound engineering.
As someone who has had to port code that made use off assembly (x86 to arm), I appreciate the fact that also made a portable C implementation that could easily be switched on.
I think delay slots and similar tricks have been abandoned not because they failed to enable high performance, but for pragmatic and business reasons: the details end up changing every 2-3 years, so you need to re-compile all your code, and it takes 2-3 years for the compiler to get good and stable.
I've never worked on compilers, or anything near the assembly level; but wouldn't changing the details like delay slots be minor changes to the compiler that could be made with fairly minimal effort.
Also, wouldn't it be possible to ship the code in an intermittent form that closely resembles the hardware, but allows the OS to finish compiling it with the correct details.
> Also, wouldn't it be possible to ship the code in an intermittent form that closely resembles the hardware, but allows the OS to finish compiling it with the correct details.
That's what you can do with LLVM or Java byte code.
LLVM is far from platform independent, though, and encodes a lot of detail into its bitcode (struct alignment, architecture specific types, to name but two variations).
> the details end up changing every 2-3 years, so you need to re-compile all your code
I could be wrong, but I think that was part of the original plan: Machine code would be thrown away along with the hardware, only the (implicitly C) sources would be saved, and you'd rebuild the world with your shiny new compiler revision to make use of a shiny new computer. Implicit in this model is the idea that compilers are smart and fast and can take advantage of minor hardware differences.
Compare this to the System/360 philosophy, where microcode is meant to 'paper over' all differences between different models of the same generation and even different generations of the same family so machine code is saved forever and constantly reused. (This way of doing things was introduced with the System/360, as a matter of fact.) Implicit in this model is the idea that compilers are slow, stupid, and need a high-level machine language where microcode takes advantage of low-level machine details.
A half-step between these worlds is bytecode, which can either be run in an interpreter or compiled to machine code over and over again. The AS/400, also from IBM, takes the latter approach: Compilers generate bytecode, which is compiled down to machine code and saved to disk when the program is first run and whenever the bytecode is newer than the machine code on disk; when upgrading, only the bytecode is saved, and the compilation to machine code happens all over again. IBM was able to transition its customers from CISC to RISC AS/400 hardware in this fashion.
As you said, the world didn't work like the RISC model, and we now have hardware designed on the System/360 model along with compilers even better than the ones RISC systems had designed for them. Getting acceptable performance out of C code has never been easier, but going the last mile to get the absolute most means making increasingly fine distinctions between types of hardware that all try hard to look exactly the same to software.
> It seems that if you accept the façade that says "x86 is x86" you're better off writing in C
Sure... but also remember that most C code ends up having to be compiled for the lowest common denominator. All your x86 Linux distro binaries for instance will be compiled for 'generic' x86-64, i.e. something compatible with the 10 year old AMD64 3000+.
Ok, well you say... we can just turn on arch specific compiler options: -msse4.2 -march=core2 etc, and this is true, but there's still less incentive for compiler programmers to produce wonderful transforms and optimisations utilising SSE 4.2 and AVX2, or a very specific target pipeline/behaviour, than there is for generic optimisations that can be applied across platforms or processor families. In fact a lot of SSE code generated by compilers is fairly rudimentary even with these options.
Part of the 'problem' really is we have this model where x86 CPUs are being engineered with additional complexity to run old code faster, even though a lot of that old code generally doesn't need that performance. You only have to look at some of the other architectures to see some interesting approaches that x86 has ignored.
Perhaps we need to be more of the mindset that recompiling code when moving from AMD64 to Core2 is as obvious as recompiling it when moving from ARM to x86.
It's fortunate, since it made Intel able to encode the new XACQUIRE/XRELEASE of the Transactional Synchronization Extensions' Hardware Lock Elision using these short ancient prefixes:
"Hardware Lock Elision adds two new instruction prefixes XACQUIRE and XRELEASE. These two prefixes reuse the opcodes of the existing REPNE/REPE prefixes (F2H/F3H). On processors that do not support TSX, REPNE/REPE prefixes are ignored on instructions for which the XACQUIRE/XRELEASE are valid, thus enabling backward compatibility."
And REPNE RET, among others, similarly has an interpretation in Intel® Memory Protection Extensions (lawyers made me write that) to help make the new pointer bounds checking ABI transparent & interoperable with other code.
While we're at it, why did Ahmed Bougacha register an entire domain for this topic?!?! (srsly, I'm wondering if he's just super-passionate about this topic and had the resources to spare? Is this the beginning of a much larger set of questions related to the topic, etc.? )
Both! I have plenty of drafts in the area. Right now I’m working on binary analysis / translation / decompilation using LLVM (I’m actually gearing up for an RFC/patch on the mailing list): http://dagger.repzret.org.
If you’re curious, I gave a talk at the LLVM Euro conference about the kind of problems you’d have, and some interesting use cases.
We’ve been quiet ever since, I’ll get back to writing about the meat once we have something out there, and more feedback from the community!
This was really interesting, but one thing it didn't fully explain was why the fact that ret imm16 was fastpathed would make it a worse choice for this optimization than repz ret. Excuse my noobiness, but can anybody explain that to me?
Right, because ARM cores exhibit completely predictable behavior with, say, assignments to IP or instruction predicate bits?
All architectures are insane. MIPS, probably the epitome of a minimal/clean architecture, still has a vestigial 1980's pipeline stage (the branch delay slot) baked right into the ISA such that it can never be removed.
Is the x86 ISA a big mess? Yeah. But ARM is hardly better (anyone remember Jazelle?) Basically, if you don't want to look at the assembly stick to your compiler. That's what it's there for. But don't pretend that this stuff isn't present everywhere -- real engineering involves tradeoffs.
As someone who's fought for months with the ARM cortex-a9 cache prefetch behaviour and behaviour of the various memory attributes (not intuitive and only partially documented), I wholeheartedly agree with you.
Those are bits of hardware which are supposed to run around GHz speeds, they might look clean enough on the surface but when you dig deep enough "hic sunt dracones".
x86 is great. You'll absolutely see kludges on RISC architectures (e.g. anything having to do with loading a 32-bit or 64-bit constant into a register). Then you have branch delay slots on MIPS, instruction alignment issues on certain PPC processors (G5), etc.
One of the nice things about x86 is that while the encoding is a little wonky, the semantics are great. No imprecise exceptions, no branch delay slots, strong memory ordering, etc. The last one in particular is absolutely wonderful in this day and age of multicore processors (scumbag RISC: makes you use memory fences everywhere; makes memory fences take 300 cycles).
This is an issue with AMD's branch predictor, not with the x86 ISA. The code would work fine with a single "ret" instead, however people found using "rep ret" made it nanoseconds faster. it just shows how you can find optimizations in the most unexpected of places.
- Fetches 16-32 bytes
- Finds instruction boundaries (because of variable-length encoding)
- Branch predicts any branches in the fetch group
- Decodes 3-4 instructions in the fetch group
- Issues 3-4 decoded operations to the rest of the pipeline
This whole process has to happen every cycle (though it's pipelined over 3-5 cycles in the front end), and the performance of this process gates the rest of the pipeline. As a result, many of the micro-architectural "gotchas" in modern x86 processors tends to be in the fetch/decode phase.