Do you know how much faster llama.cpp would go on something like an Intel Core i9 (has AVX2 but not AVX512) when it's compiled using `cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx`? Are we talking like 10% faster inference, or 100%?
Right now I'm reasonably certain llamafile is doing about the best job it can be doing on Intel/AMD, supporting SSSE3-only, AVX-only, and AVX2+F16C+FMA microprocessors at runtime. In fact, there's even an issue with the upstream llama.cpp project where they want to get rid of all the external BLAS dependencies. llama.cpp authors claim their quantization trick has actually enabled them to outdistance things like cuBLAS and I'd assume MKL too, which at best, can only operate on f32 and f16. https://github.com/ggerganov/ggml/issues/293
My concern with MKL is also that, judging by the llama.cpp README's brief mention of using it, adding support sounds like it'd entail a lot more than just dynamically linking a couple GEMM functions from libmkl.so/dll/dylib. It sounds like we'd have to go all in on some environment shell script and intel compiler. I also remember MKL being a huge pain on the TensorFlow team, since it's about as proprietary as it gets.
Last year we got 10x performance improvements on pytorch stable diffusion although there was more to it than just using MKL.
Not sure how well this works for LLM. But the hardware is much, much faster than people think - even before using the ML accelators that some new CPUs have - but the software support seems to be lacking.
The intersection between people who use emacs for coding, and those who own a mac studio ultra must be miniscule.
Intel MKL + some minor tweaking gets you really excellent LLM performance on a standard PC, and that's without using the GPU.