Things like this are why I got out of writing ASM for anything back in the 1990s...

derleth · on Aug 26, 2013

> Along with that you can't make it portable on other architectures.

Not only that, but it seems from other stuff I've read that you'd have a hard time making x86 code performant on more than one microarchitecture or family thereof, which means re-writing the same code for different manufacturers and different generations from the same manufacturer.

It seems that if you accept the façade that says "x86 is x86" you're better off writing in C or even Haskell, but if you peel the façade back to write fully performant code you're doomed to either keep re-writing it or doomed to see it languish as processor technology moves on.

I'm immediately reminded of the early RISC design concept of "moving the microcode into the compiler"; having extremely simple hardware that explicitly exposed design details like pipeline length in the form of delay slots becoming ISA features and saying that compilers would handle all the resulting complexity much like how microcode handled it in the System/360 and similar.

It turns out architectural branch delay slots were a mistake, but compilers have gotten better at turning obvious source code into performant machine code unless you need so much performance you have to dig into architectural details anyway.

It's unavoidably complicated. And now we're moving towards multi-core CPUs, which bring to the fore another area which is unavoidably complicated.

onan_barbarian · on Aug 26, 2013

This is spot on. We use a lot of intrinsics, as we want to get at some of the underlying features of the architecture, but we almost never write asm. It's becomes obsolete almost immediately and is often far worse than the compiler's first cut.

We had a former performance guru who once put a huge #ifdef I_DONT_CARE_ABOUT_PERFORMANCE around the "naive" C version of a loop and had his favourite x86 asm version on the #else branch of the ifdef. Needless to say, much hilarity resulted when defining this macro improved performance. I suspect that the naive version may have been a bad idea at one stage (Pentium 4, perhaps) but became a better version from Core 2 onwards...

Ocassionally I'll get on a wild tear and think "Oh, I can do better than the compiler" after it generates some ridiculous-looking instruction sequence that seems to not be using enough registers (or whatever). I then find that hand-writing and hand-scheduling everything in accordance to what looks really good to me - actually works worse than the insane-looking code in the compiler. Put it down to stuff like store-to-load forwarding, dynamic scheduling, register renaming, etc.

exDM69 · on Aug 27, 2013

> Ocassionally I'll get on a wild tear and think "Oh, I can do better than the compiler" after it generates some ridiculous-looking instruction sequence that seems to not be using enough registers (or whatever).

Compilers are very good in some things like instruction scheduling and register allocation but only poor to mediocre in other things like instruction selection. In addition, compilers are not allowed to do some optimizations, like change the order of function calls to other translation units or do arithmetic optimizations with floating point numbers.

So if you have a piece of performance critical code, you can still be smarter than the compiler alone by co-operating with your compiler by writing compiler-friendly close to the metal code. Inline assembly code is not good for the compiler, it can't really optimize it at all. But using CPU and compiler specific intrinsics to use your hardware capabilities (SIMD, special cpu instructions) you can get your code to run close to the speed of light without compromising readability or maintainability.

Yes, compilers these days are smart but thinking that you can't do better than that is a lazy man's fallacy.

notacoward · on Aug 27, 2013

"Inline assembly code is not good for the compiler, it can't really optimize it at all."

Actually it can, if you have a way to specify register usage (gcc does) and the programmer gives the right specification. Unfortunately, that almost never seems to happen in practice.

exDM69 · on Aug 27, 2013

> Actually it can

Yeah, I found this out the hard way. Turns out it's really ridiculously difficult to get GCC to put the rdtsc instruction where you want it :)

But you get the point, the compiler can optimize better if you use C and intrinsics than inline asm.

notacoward · on Aug 28, 2013

Truth. I only use asm for stuff I can't do in C, like access architecture-specific registers or use instructions for which there are no C operators. Anything else strikes me as a bit of an exercise in machismo at the expense of sound engineering.

gizmo686 · on Aug 27, 2013

As someone who has had to port code that made use off assembly (x86 to arm), I appreciate the fact that also made a portable C implementation that could easily be switched on.

eropple · on Aug 26, 2013

> a former performance guru

I know a lot of folks who have not reacted well to compilers becoming smarter than they are. =)

wtallis · on Aug 26, 2013

I think delay slots and similar tricks have been abandoned not because they failed to enable high performance, but for pragmatic and business reasons: the details end up changing every 2-3 years, so you need to re-compile all your code, and it takes 2-3 years for the compiler to get good and stable.

gizmo686 · on Aug 27, 2013

I've never worked on compilers, or anything near the assembly level; but wouldn't changing the details like delay slots be minor changes to the compiler that could be made with fairly minimal effort.

Also, wouldn't it be possible to ship the code in an intermittent form that closely resembles the hardware, but allows the OS to finish compiling it with the correct details.

eru · on Aug 27, 2013

> Also, wouldn't it be possible to ship the code in an intermittent form that closely resembles the hardware, but allows the OS to finish compiling it with the correct details.

That's what you can do with LLVM or Java byte code.

gsnedders · on Aug 28, 2013

LLVM is far from platform independent, though, and encodes a lot of detail into its bitcode (struct alignment, architecture specific types, to name but two variations).

derleth · on Aug 27, 2013

> the details end up changing every 2-3 years, so you need to re-compile all your code

I could be wrong, but I think that was part of the original plan: Machine code would be thrown away along with the hardware, only the (implicitly C) sources would be saved, and you'd rebuild the world with your shiny new compiler revision to make use of a shiny new computer. Implicit in this model is the idea that compilers are smart and fast and can take advantage of minor hardware differences.

Compare this to the System/360 philosophy, where microcode is meant to 'paper over' all differences between different models of the same generation and even different generations of the same family so machine code is saved forever and constantly reused. (This way of doing things was introduced with the System/360, as a matter of fact.) Implicit in this model is the idea that compilers are slow, stupid, and need a high-level machine language where microcode takes advantage of low-level machine details.

A half-step between these worlds is bytecode, which can either be run in an interpreter or compiled to machine code over and over again. The AS/400, also from IBM, takes the latter approach: Compilers generate bytecode, which is compiled down to machine code and saved to disk when the program is first run and whenever the bytecode is newer than the machine code on disk; when upgrading, only the bytecode is saved, and the compilation to machine code happens all over again. IBM was able to transition its customers from CISC to RISC AS/400 hardware in this fashion.

As you said, the world didn't work like the RISC model, and we now have hardware designed on the System/360 model along with compilers even better than the ones RISC systems had designed for them. Getting acceptable performance out of C code has never been easier, but going the last mile to get the absolute most means making increasingly fine distinctions between types of hardware that all try hard to look exactly the same to software.

nly · on Aug 27, 2013

> It seems that if you accept the façade that says "x86 is x86" you're better off writing in C

Sure... but also remember that most C code ends up having to be compiled for the lowest common denominator. All your x86 Linux distro binaries for instance will be compiled for 'generic' x86-64, i.e. something compatible with the 10 year old AMD64 3000+.

Ok, well you say... we can just turn on arch specific compiler options: -msse4.2 -march=core2 etc, and this is true, but there's still less incentive for compiler programmers to produce wonderful transforms and optimisations utilising SSE 4.2 and AVX2, or a very specific target pipeline/behaviour, than there is for generic optimisations that can be applied across platforms or processor families. In fact a lot of SSE code generated by compilers is fairly rudimentary even with these options.

Part of the 'problem' really is we have this model where x86 CPUs are being engineered with additional complexity to run old code faster, even though a lot of that old code generally doesn't need that performance. You only have to look at some of the other architectures to see some interesting approaches that x86 has ignored.

Perhaps we need to be more of the mindset that recompiling code when moving from AMD64 to Core2 is as obvious as recompiling it when moving from ARM to x86.