Apple could dramatically improve performance if they just tasked *one* Metal eng...

Apple could dramatically improve performance if they just tasked one Metal engineer on llama.cpp. Like, just to finish up flash attention and quantum KV cache, and optimize the Metal kernels.

I wouldn't be surprised if they could double performance.

I know Apple is pushing MLX, and MLC-LLM is fast too, but in practice most Mac users (I think) are using llama.cpp based stacks.