I don't know what to make of results without any idea how to reproduce them. For...

celrod · on Aug 13, 2020

I should include more detailed instructions. Dependencies include: eigen, gcc, clang, and (harder) the Intel compilers, as well as a recent version of Julia.

If you have those, you should be able to instantiate the Manifest: https://github.com/chriselrod/LoopVectorization.jl/tree/mast... and then `include("/path/to/LoopVectorization/benchmark/driver.jl")` to run the benchmarks.

There are other examples. For example, for code as simple as the dot products, clang shows some pretty bad performance characteristics while gcc needed a few infrequently used compiler options to do well there.

Image filtering / convolutions without compile-time known kernel sizes are another example that showed dramatic speedup.

When talking about optimized gemm, LoopVectorization currently only does register tiling. Therefore its matmul performance will continue to degrade beyond 200x200 or so on my deskop. An optimized BLAS implementation would still need to take care of packing arrays to fit into cache, multithreading, etc.

Maybe I should add pluto. I did test Polly before, but it took several minutes to compile the simple c file, and the only benchmark that seemed to improve was the gemm without any transposes. And it still didn't perform particularly well there -- worse than the Intel compilers, IIRC.

PRs are welcome.