I don't know what to make of results without any idea how to reproduce them.
For what it's worth, BLIS' generic C DGEMM code gave around 60% of the hand-coded kernel on Haswell with GCC (8, from memory) on large matrices without any real effort to speed it up. Obviously that's not a simple triply-nested loop, since just vectorizing is useless for large dimensions, and the generic code does get well auto-vectorized.
There are results from Polly/LLVM showing it do well against tuned GEMM. However, Polly apparently pattern-matches the loop nest and replaces it with a Goto BLAS-like implementation, and I don't know whether it's now in clang/flang. I also don't know how well the straight Pluto optimization does relative to BLAS, and whether that's what original Polly uses; it's a pity GCC's version isn't useful.
I should include more detailed instructions.
Dependencies include: eigen, gcc, clang, and (harder) the Intel compilers, as well as a recent version of Julia.
There are other examples. For example, for code as simple as the dot products, clang shows some pretty bad performance characteristics while gcc needed a few infrequently used compiler options to do well there.
Image filtering / convolutions without compile-time known kernel sizes are another example that showed dramatic speedup.
When talking about optimized gemm, LoopVectorization currently only does register tiling. Therefore its matmul performance will continue to degrade beyond 200x200 or so on my deskop. An optimized BLAS implementation would still need to take care of packing arrays to fit into cache, multithreading, etc.
Maybe I should add pluto.
I did test Polly before, but it took several minutes to compile the simple c file, and the only benchmark that seemed to improve was the gemm without any transposes. And it still didn't perform particularly well there -- worse than the Intel compilers, IIRC.
For what it's worth, BLIS' generic C DGEMM code gave around 60% of the hand-coded kernel on Haswell with GCC (8, from memory) on large matrices without any real effort to speed it up. Obviously that's not a simple triply-nested loop, since just vectorizing is useless for large dimensions, and the generic code does get well auto-vectorized.
There are results from Polly/LLVM showing it do well against tuned GEMM. However, Polly apparently pattern-matches the loop nest and replaces it with a Goto BLAS-like implementation, and I don't know whether it's now in clang/flang. I also don't know how well the straight Pluto optimization does relative to BLAS, and whether that's what original Polly uses; it's a pity GCC's version isn't useful.