I just got NPU based LLM inference working locally on Snapdragon X Elite with small (3B and 8B) models, but it’s not quite production ready yet. I know all llama.cpp wrappers claim to have it on their roadmap, but the fact of the matter is that they have no clue about how to implement it.
> M1 MacBook was 30 times faster at generating tokens.
Apples and oranges (pardon the pun). llama.cpp (and in turn LMStudio) use Metal GPU acceleration on Apple Silicon, while they currently only do CPU inference on Snapdragon.
It’s possible to use the Adreno GPU for LLM inference (I demoed this at the Snapdragon Summit), which performs better.
> M1 MacBook was 30 times faster at generating tokens.
Apples and oranges (pardon the pun). llama.cpp (and in turn LMStudio) use Metal GPU acceleration on Apple Silicon, while they currently only do CPU inference on Snapdragon.
It’s possible to use the Adreno GPU for LLM inference (I demoed this at the Snapdragon Summit), which performs better.