Please show us a proof - I have yet to see any non-toy program / routine with complex control flow written in assembly beating a modern optimizing compiler.
You can optimize a small routine. Anything that's not a toy rapidly becomes impossible to optimize by hand.
Just the register allocation itself can be a gigantic pain to do by hand, even with a limited register set of x86, let alone other RISCs with much larger register files.
Just the register allocation itself can be a gigantic pain to do by hand, even with a limited register set of x86,
Actually, that's one of the main advantages of using Asm --- compilers are relatively horrible at register allocation, because they don't take a more "holistic" view of the program, and end up shuffling values between registers or registers and memory far more often. This is why handwritten Asm has a unique "texture" to it.
let alone other RISCs with much larger register files
There's much less room for optimisation with handwritten Asm on a RISC, because with x86 the opportunities are precisely those the compiler can't easily see, unlike a very uniform and boring RISC.
You say those things. Please show me an example of a real world, general purpose, non toy sized program or routine that is hand optimized and shown to perform better than the optimizing compiler.
I realize you believe what you are saying, which makes me feel bad about downvoting and moving on, so I'll try a more complete response.
First, you are setting up a classic "no true Scotsman". By accepting that "toy" routines can be optimized, you grant a response where (by your definition) if it can be optimized it must be a "toy". Such arguments usually aren't very satisfying to participate in.
But perhaps we can start from there. Assume we have a complex program that relies on a numerical computation that takes 90% of the runtime. While this isn't the norm for a modern GUI program used at home, I'll assert from authority and experience that such programs do exist within the realm of scientific computing.
I'll then go on to assert from authority and experience that you can often optimize such routines with vectorized assembly to run 2x as fast on a particular modern processor. In theory, a compiler could do the same. In practice, they don't --- or at least, they don't unless you are willing to write performance-fragile and non-portable code that's actually more fragile than clear assembly
As for a specific example, how about this one: http://chemfp.com. By using an approach that was possible with assembly but not with C, I was able to give about a 40% overall boost in speed to this program when used for a particular task. I optimized it in assembly, it's not a toy, and it's faster as a result.
It's also true that I was able to get about a 30% boost while sticking with compiler vector intrinsics and a particular version of a particular compiler, but the performance was so fragile to version changes that it seemed safer to go with inline assembly. And the only reason I was able to get this boost was that I already knew the assembly I was trying to generate.
I think instead the right lessons are:
1) It's often possible for an experienced programmer to make things run significantly faster on a specific generation of processors by optimizing in assembly.
2) Blind translation to assembly usually won't achieve much of an improvement, and might even hurt. It requires deeper knowledge of the internals of how processors work, roughly comparable to the level of understanding a compiler designer might need.
3) Outside of special cases where extreme performance is necessary, it's rarely worth pursuing this sort of optimization in one-off programs. Usually, the effort is better spent in libraries and compilers that will allow greater leverage.
GEMM is a fairly special case, but the case is unproven as far as I can see. The OpenBLAS kernels are certainly bigger than they need be, but it's not clear to me they satisfy the request (especially addressing "holistic" treatment of registers), and "OpenBLAS" isn't a performance measurement.
Since I'm interested in this, I'd like to see the case to reproduce with an analysis of the performance difference. BLAS developers have made statements about GCC optimization failures (at least vectorization) that aren't correct in my experience with at all recent GCC. BLIS' generic C kernel already runs at ~60% (I forget the exact number) of the speed of the Haswell assembler code for large DGEMM without any tuning attempt, perhaps with GCC extensions. (I didn't check whether the blocking is right for Haswell or pursue an analysis with MAQAO or something.)
You can optimize a small routine. Anything that's not a toy rapidly becomes impossible to optimize by hand.
Just the register allocation itself can be a gigantic pain to do by hand, even with a limited register set of x86, let alone other RISCs with much larger register files.
EDIT: Added clarification in the wording.