For the limited scope of rectangular loop nests without loop carried dependencies, my Julia library LoopVectorization.jl is able to vectorize and achieve often significantly better performance than Clang, GCC, or the Intel compilers; see the different examples here: https://chriselrod.github.io/LoopVectorization.jl/latest/exa...
However, that's only one component of vectorization. It's still relying on LLVM to do lots of clean up, among other things.
I feel like there's a lot of reverence surrounding compilers and their capabilities, but at least when it comes to SIMD, it's not hard to find examples where a human can do much better, or can at least come up with algorithms that will do better for that and related problems.
I think saying compilers produce better asm than a programmer is like saying statistical libraries like `GLM.jl` are better at doing statistics than statisticians. True in that you should let the routines crunch the numbers. The skill is in deciding which routines/analysis to run, how to organize them, and -- when necessary -- when to develop your own.
I don't know what to make of results without any idea how to reproduce them.
For what it's worth, BLIS' generic C DGEMM code gave around 60% of the hand-coded kernel on Haswell with GCC (8, from memory) on large matrices without any real effort to speed it up. Obviously that's not a simple triply-nested loop, since just vectorizing is useless for large dimensions, and the generic code does get well auto-vectorized.
There are results from Polly/LLVM showing it do well against tuned GEMM. However, Polly apparently pattern-matches the loop nest and replaces it with a Goto BLAS-like implementation, and I don't know whether it's now in clang/flang. I also don't know how well the straight Pluto optimization does relative to BLAS, and whether that's what original Polly uses; it's a pity GCC's version isn't useful.
I should include more detailed instructions.
Dependencies include: eigen, gcc, clang, and (harder) the Intel compilers, as well as a recent version of Julia.
There are other examples. For example, for code as simple as the dot products, clang shows some pretty bad performance characteristics while gcc needed a few infrequently used compiler options to do well there.
Image filtering / convolutions without compile-time known kernel sizes are another example that showed dramatic speedup.
When talking about optimized gemm, LoopVectorization currently only does register tiling. Therefore its matmul performance will continue to degrade beyond 200x200 or so on my deskop. An optimized BLAS implementation would still need to take care of packing arrays to fit into cache, multithreading, etc.
Maybe I should add pluto.
I did test Polly before, but it took several minutes to compile the simple c file, and the only benchmark that seemed to improve was the gemm without any transposes. And it still didn't perform particularly well there -- worse than the Intel compilers, IIRC.
Have there been attempts to measure modern compilers' ability to optimize large programs against "perfect" machine code? The few times I've poked through demo scene binaries, I've felt a sense of awe at the complexity described by just a few instructions, but demos aren't starting from a specification and trying to find a compressed representation, they're trying to create small binaries that expand into something complex and interesting. Clearly this problem is NP-hard as hell (and most software doesn't need to be optimized instruction-by-instruction), but it would be incredibly neat to have an estimate of _how much worse_ our compilers might be.
I'm reminded of the old story about someone who tried to evolve an FPGA configuration to perform signal processing, and the final generation exploited imperfections in the chip to use fewer gates.
That would indeed be very neat. I think it's also impossible to do in a reasonable way, because how well a compiler does is probably dependent in a huge way on the particular program. There is a huge class of program transformations that are technically correct but far too specific to implement in a general-purpose compiler, and I think how well a compiler does is mostly dependent on how many of those transformations happen to apply to your program at hand.
For example: suppose you have a program that renders a 3D scene from a hardcoded scene description. A compiler might do a lot of stuff to that program, but what it will certainly not do is notice that the whole hardcoded scene can be replaced with a small representation in the form of a distance estimation function, if the rendering algorithm is also changed from raytracing to distance estimation.
A human in a demo scene situation will certainly perform (or at least try to perform) that transformation, if only because distance estimation is well known to often provide compact representations of certain, a priori very complex, scenes.
Also, nitpick:
> Clearly this problem is NP-complete as hell
I think it's almost certainly not even in NP; that would require a polynomial algorithm that, given some evidence that you can choose, proves that a particular program really is optimal. I think that's still impossible (but am gladly proven wrong). If I'm correct, it's not NP-complete, it's NP-hard.
> how well a compiler does is mostly dependent on how many of those transformations happen to apply to your program at hand
Right, there's an infinite number of distinct useful code optimizations, there's a cost to checking if any given optimization can be applied, and some optimizations have _massive_ costs for very rare and specific savings, so any given compiler is making a practical decision to include some optimizations and omit others. There was some discussion in the Rust community about LLVM having a period where they weren't tracking regressions in compile times - so "the perfect optimizing compiler" isn't really a coherent goal. But I still wonder how much faster Optimal Chromium would be. Just another interesting number I'll never know, I suppose.
> I think it's almost certainly not even in NP
Yeah, that was sloppy of me, I think this is undecidable by Rice's theorem. Nice catch!
The field you're looking for is superoptimization. Last I checked most of the work focused on loop-free snippets so that program length could serve as a proxy for other performance metrics. STOKE shows up frequently in those discussions, but people have tried all kinds of things ranging from genetic algorithms to neural networks to SMT solvers.
To actually answer your question, compilers are pretty good, and so are expert human assemblers, but they both miss things even for short snippets.
And the flip side is that STOKE and friends can find new, faster ways to compile code... only for short snippets :).
That said, compilers and superoptimizers can go together well. One of my favorite examples is using program synthesis to generate peephole optimization rules for a compiler[1]. The synthesis step takes a long time and only works for short snippets, but if you can turn the results of that into rules a compiler's optimization system can use, you only have to pay the synthesis cost once and the compiler can apply those rules to speed up large programs.
The evolved circuit used fewer gates, but only worked on the device under test and when they removed gates that were apparently doing nothing, it failed to perform the task. ML cheating at its best!
"cheating" isn't, when it wasn't even taught what the rules were in the first place!
I wonder if there's a similar analogy with how a young kid or someone completely unfamiliar with something can come up with an amazing solution that more experienced people would never think of --- because the latter had "hardened" their thinking to accustomed patterns and didn't go beyond them.
It boils down to various kinds of normalization - as in functional programming's normal form. If you can get a normal form of an expression (or canonical form), you can fuse several seemingly different expressions into one and share its computation in several places.
As an example aside from functional programming, consider instruction set architectures.
There are 3-address RISC instructions and "x := y - z" can be encoded in 32^3 different ways. It is not practical to consider all of them as a separate thing and share them.
The situation is less severe in two address ISAs and even less severe in 1 address (accumulator based) ISA (8086 is a blend between 1 and 2 address ISA).
For zero address ISAs it is even better. There's only one way to compute y-z, by issuing "-", which takes two arguments from stack and places subtraction result into it. Situations that lead to that subtraction are less diverse than in RISC 3-address way. And here comes the gist: one can factor same sequences of opcodes into subroutine ("word" in Forth parlance) and replace these sequences with call to that subroutine.
By programming Forth you do LZ compression on programs.
If you can't compress your program more, you have a perfect Forth program, and I am not quite joking here.
The same applies, albeit not that dramatically and visible, to other things in programs. It is possible to determine algorithmic kernels in programs (by matching optimized data and control flow against a pattern) and replace them with different ones - less computation intensive, parallelized, parametrized, etc. There were such works in optimization literature.
Again, in the kernel matching optimization goes first, because optimization makes program more canonical, so to say. And matching on canonical forms is easier.
"Optimal" compilation is actually harder than NP-complete. Because program equivalence is undecidable, finding a canonical "optimal" representation for a program can't be done in general.
What about superoptimization?
For instance, Stoke (http://stoke.stanford.edu/) or Souper (https://github.com/google/souper). They will not find all optimizations of course, and program equivalence is undecidable indeed, but they are a good shot at optimal compilation, I would say.
Just because a problem is undecidable that doesn't mean that you can't in fact solve it for a large set of interesting instances (maybe all instances you care about).
It means the idea isn't necessarily wholly impractical. That's not nothing.
I imagine superoptimisers scale extremely poorly, but that's a different question. I imagine they're best suited to bit-twiddling problems rather than, say, figuring out the best assembly code to implement a parallel Timsort.
I have to admit ignorance here though, I know almost nothing about superoptimisers.
It will be hard to find agreement on what is perfect machine code. Smallest binary? Fastest? Least number of instructions? Lowest power consumption? Able to run equally well across more CPU generations? Able to be understood and debugged by more? Using CISC silicon which has more functionality baked in versus an RISC one? etc...etc.
At the level the article look at, it is not even worth doing better, since the potential gain is minuscule compared to the available disk and memory, and this has been the case at least since the late 80s. That does not mean that it is always better to let the compiler do all optimizations.
There are still cases, especially where clever uses of SIMD instructions can pay off. This is especially the case for video processing, deep learning engines and numerical code. It is still hard to beat the compiler even in those cases. You have to know about branch prediction, that modern processors execute many instructions in parallel (superscalar) as well as instruction latency. You also need to know about the cache hierarchy, but that's the case even when you let the compiler generate the code. Writing highly optimized code like that is not something every programmer will do, but most will use code written by others that has part in optimized assembly. An example of that is ffmpeg.
Within Ardour, a cross-platform DAW, we introduced hand-written SSE-based asm for metering, which is the most expensive computational operation in the application (without plugins): compare every single (floating point) audio sample that passes through to some recent max and min, and if larger or smaller, store it as the new max or min. The savings were dramatic, though dependent on the size of the buffers that get processed. With larger buffers, adding this asm reduced DSP load by 30%.
This was done 15 years ago, with a few minor updates about 6 years ago. Recently, we've revisited it to add support for AVX on Linux, which was missing.
It turns out that gcc can now do more or less as good a job as the hand-crafted asm. The hand-crafted version understood (mis)alignment better than gcc from 15 years ago, but at this point, gcc's inner loop is 1 instruction less than the hand-crafted version, although its setup before reaching the inner loop is several instructions longer.
There is still some human skill required to understand the best way to suggest to gcc what is going on, but it increasingly seems as though it's likely best to let the compiler do the actual code generation, even for SIMD.
Compilers get better every year. Hand-tuned assembly only gets better if someone pays attention to it.
So even if you can beat the compiler today, you might not be able to beat it tomorrow.
Therefore it is often useful to have two switchable implementations. High-level code that is basically straightforward, and hand-tuned assembly. You can the compare the two with every new compiler release.
Not only that, but your micro-architecture-specific hand-tuning probably isn't good for tomorrow's micro-architecture, and definitely not for a new architecture. You still need craft kernels in cases like BLAS, but you really want to reduce them, e.g. BLIS in the case of BLAS.
Except if the algorithm is changed. On https://codegolf.stackexchange.com there are regularly solutions that are very short, but by necessity also employ a suboptimal algorithm, and are thus slower (sometimes very much slower) than the optimal program in terms of performance.
If you want to see the assembly code a compiler produces from high level languages, Compiler Explorer is awesome for that (https://gcc.godbolt.org/)
You type your C/C++ code in the left pane, and it generates assembly in the right pane as you type. There is color coding to show you which C/C++ code correlates to the assembly code. If you hover over an assembly instruction such as mov, a tooltip appears telling you what it does.
It makes exploring what the compiler is doing easy and a lot of fun
There will probably always be particular code transformations (algorithmic or target-machine-specific) that speed up your code and that your general purpose compilers will not find. (And that's good, because otherwise the compile time would be horrible!)
The good thing about compiler optimizations is not that they produce actual "optimal" binaries, but that they allow you to not care too much about performance when writing code (e.g. using abstractions that will be inlined, etc.) and still get a decent binary.
If you then require more performance (which will probably not be the case for most applications) you can still search for bottlenecks and start optimizing by hand there.
It's not that people do better than compilers, it's just compilers often do badly. That's why checking disassembly in tight spots is the integral procedure in our team. Compilers and people together by definition do not worse than compilers alone.
Also, this is an odd choice of program to compare. Compilers often do suboptimal on superscalarisation, inlining, or simplifying control flow. In this example, there is not enough space to go wrong nor for a compiler neither for a programmer.
Isn't the first thing to do looking at what the compiler can tell you (-fopt-info for GCC)? That can be helpful without deep knowledge of the target, or knowing its assembler. There are also dumps from stages before generating assembly which may be helpful. Is it deep knowledge of the whole system that makes going straight to assembly best?
Yes, GCC intermediates are quite nice. Lispy and readable. And indeed, you can take a look at every stage. But LLVM intermediate code is basically just another assembly language. And as for MSVC, I'm not sure if it's even accessible. So it's not always an option.
Disassembly is either way our end result so unless you're looking for something particular, it makes sense to browse that first just to check that the compiler gets your intention just right.
#include <numerous_caveats.h>. We are not checking that everything got written, let alone handling a partial write. Systems have some limit on how many elements you can have in the iov array passed to writev.
I find that manual assembly is mostly useful when I know something about the data or desired operation that is not possible to communicate to the compiler.
I feel that if you're one of the rare people who know how and when crafted assembly works better, it's almost always also a better use of your time to work on the compiler, rather than whatever you are working on.
Even years ago when the compilers were not as good this is exactly what most people did. Compile it look at the asm. Sometimes see better bits here and there and replace that function with an asm one. If you are in a space where that perf really maters you do this more often.
Even more difficult: for how long? - Compilers get better and CPUs get new instructions one might miss in the manual code. If you have your tight critical loop in assembler one has to regularly verify.
In cases of parallel processing where code executing in the critical path is a significant bottleneck, hand-rolled SIMD can sometimes outperform what gcc or clang spits out by a few clock cycles/iteration. But even then, the sanest way to do that is to let the compiler do most of the legwork and then incrementally tweak its output.
IME gcc is extraordinarily bad at anything but the most basic of auto-vectorizations, but its output from SIMD intrinsics is decent -- not usually quite as good as hand-rolled assembly, but much faster to write (for me anyway, YMMV) and still within 2x performance.
It's definitely the case that some of the ELF header fields can be repurposed to store additional data, the following articles all address that optimization:
There are also a few instructions in the article that are size-suboptimal (e.g. "movl $4, %eax" is 5 bytes (b804000000), while the equivalent "xor %eax, %eax; movb $4, %al" is only 4 bytes (31c0b004)).
I tried to do it as optimal as possible, but I think there is indeed more to be saved with putting code in the elf header, i'm not very familiar with the elf format. This is as small as I could get it without doing that, just using yasm, ld and strip.
It depends on how much information you give to the compiler. A modern GCC can do magic like AVX vectorization and OpenMP parallelization if you give your function the right restrict and pragma annotations.
Something like #pragma simd may actually be worse than letting GCC do its thing. It may not use FMA, just AVX, for instance. You may also want (a subset of) -ffast-math for NEON, reductions, etc., if your code is OK with it.
However, that's only one component of vectorization. It's still relying on LLVM to do lots of clean up, among other things.
I feel like there's a lot of reverence surrounding compilers and their capabilities, but at least when it comes to SIMD, it's not hard to find examples where a human can do much better, or can at least come up with algorithms that will do better for that and related problems.
I think saying compilers produce better asm than a programmer is like saying statistical libraries like `GLM.jl` are better at doing statistics than statisticians. True in that you should let the routines crunch the numbers. The skill is in deciding which routines/analysis to run, how to organize them, and -- when necessary -- when to develop your own.