Apple could dramatically improve performance if they just tasked one Metal engineer on llama.cpp. Like, just to finish up flash attention and quantum KV cache, and optimize the Metal kernels.
I wouldn't be surprised if they could double performance.
I know Apple is pushing MLX, and MLC-LLM is fast too, but in practice most Mac users (I think) are using llama.cpp based stacks.
I wouldn't be surprised if they could double performance.
I know Apple is pushing MLX, and MLC-LLM is fast too, but in practice most Mac users (I think) are using llama.cpp based stacks.