What matters for Qwen models, and most/all local MoE models (ie. where the performance is limited) is memory bandwidth. This goes for small models too. Here's the top Apple chips by memory bandwidth (and to steal from clickbait: Apple definitely does not want you to think too closely about this):
M3 Ultra — 819 GB/s
M2 Ultra — 800 GB/s
M1 Ultra — 800 GB/s
M5 Max (40-core GPU) — 610 GB/s
M4 Max (16-core CPU / 40-core GPU) — 546 GB/s
M4 Max (14-core CPU / 32-core GPU) — 410 GB/s
M2 Max — 400 GB/s
M3 Max (16-core CPU / 40-core GPU) — 400 GB/s
M1 Max — 400 GB/s
Or, just counting portable/macbook chips: M5 max (top model, 64/128G) M4 max (top model, 64/128G), M1 max (64G). Everything else is slower for local LLM inference.
TLDR: An M1 max chip is faster than all M5 chips, with the sole exception of the 40-GPU-core M5 max, the top model, only available in 64 and 128G versions. An M5 pro, any M5 pro (or any M* pro, or M3/M2 max chip) will be slower than an M1 max on LLM inference, and any Ultra chip, even the M1 Ultra, will be faster than any max chip, including the M5 max (though you may want the M2 ultra for bfloat16 support, maybe. It doesn't matter much for quantized models)
But it doesn't matter if you have 1000GB/s memory bandwidth if you only have 32GB of vram. Well, maybe for some applications it works out (image generation?), but its not seriously competing with an ultra with 128 GB of unified memory or even a max with 64 GB if unified memory.
> but its not seriously competing with an ultra with 128 GB of unified memory or even a max with 64 GB if unified memory.
No one is arguing that either, this sub-thread is quite literally about the memory bandwidth. Of course there are more things to care about in real-life applications of all this stuff, again, no one is claiming otherwise. My reply was adding additional context to the "What matters [...] is memory bandwidth" parent comment, nothing more, hence the added context of what other consumer hardware does in memory bandwidth.
If we are talking about Apple silicon, where we can configure the memory separately from the bandwidth (and the memory costs the same for each processor), we can say something like "its all about bandwidth". If we switch to GPUs where that is no longer true, NVIDIA won't let you buy an 5090 with more 32GB of VRAM, then...we aren't comparing apples to apples anymore.
There is also prompt processing that's compute-bound, and for agentic workflows it can matter more than tg, especially if the model is not of "thinking" type.
M3 Ultra — 819 GB/s
M2 Ultra — 800 GB/s
M1 Ultra — 800 GB/s
M5 Max (40-core GPU) — 610 GB/s
M4 Max (16-core CPU / 40-core GPU) — 546 GB/s
M4 Max (14-core CPU / 32-core GPU) — 410 GB/s
M2 Max — 400 GB/s
M3 Max (16-core CPU / 40-core GPU) — 400 GB/s
M1 Max — 400 GB/s
Or, just counting portable/macbook chips: M5 max (top model, 64/128G) M4 max (top model, 64/128G), M1 max (64G). Everything else is slower for local LLM inference.
TLDR: An M1 max chip is faster than all M5 chips, with the sole exception of the 40-GPU-core M5 max, the top model, only available in 64 and 128G versions. An M5 pro, any M5 pro (or any M* pro, or M3/M2 max chip) will be slower than an M1 max on LLM inference, and any Ultra chip, even the M1 Ultra, will be faster than any max chip, including the M5 max (though you may want the M2 ultra for bfloat16 support, maybe. It doesn't matter much for quantized models)