It doesn't beat RTX 4090 when it comes to actual LLM inference speed. I bought a Mac Studio for local inference because it was the most convenient way to get something fast enough and with enough RAM to run even 155b models. It's great for that, but ultimately it's not magic - NVidia hardware still offers more FLOPS and faster RAM.
> It doesn't beat RTX 4090 when it comes to actual LLM inference speed
Sure, whisper.cpp is not an LLM. The 4090 can't even do inference at all on anything over 24GB, while ASi can chug through it even if slightly slower.
I wonder if with https://github.com/tinygrad/open-gpu-kernel-modules (the 4090 P2P patches) it might become a lot faster to split a too-large model across multiple 4090s and still outperform ASi (at least until someone at Apple does an MLX LLM).
PSA for all people who are still being misled by hand-wavy Apple M1 marketing charts[1] implicating total dominance of M-series wondersilicon obsoleting all Intel/NVIDIA PCs:
There are benchmark data showing that an Apple M2 Ultra is 47% and 60% slower against Xeon W9 and RTX 4090, or 0.35% and 2% slower against i9-13900K and RTX 4060 Ti, respectively, in Geekbench 5 Multi-threaded and OpenCL Compute tests.
Apple Silicon Macs are NOT faster than competing desktop computers, nor M1 was massively faster than NVIDIA 3070(Desktop - 2x faster than Laptop variant M1 was compared against) for that matter. They just offer up to 128GB shared RAM/VRAM options in slim desktops and laptops, which is handy for LLM, that's it.
Please stop taking Apple marketing materials at full face value or above. Thank you.
> The 4090 can't even do inference at all on anything over 24GB, while ASi can chug through it even if slightly slower.
Common LLM runners can split model layers between VRAM and system RAM; a PC rig with a 4090 can do inference on models larger than 24G.
Where the crossover point where having the whole thing on Apple Silicon unified memory vs. doing split layers on a PC with a 4090 and system RAM is, I don't know, but its definitely not “more than 24G and a 4090 doesn't do anything”.
> Common LLM runners can split model layers between VRAM and system RAM; a PC rig with a 4090 can do inference on models larger than 24G.
Sure and ASi can do inference on models larger than the Unified Memory if you account for streaming the weights from the SSD on-demand. That doesn't mean it's going to be as fast as keeping the whole thing in RAM, although ASi SSDs are probably not particularly bad as far as SSDs go.
Slightly slower in this case is like 10x. I have M3 Max with 128GB RAM, 4090 trashes it on anything under 24GB, then M3 Max trashes it on anything above 24GB, but it's like 10x slower at it than 4090 on <24GB.