The previous time this article was submitted, I did some calculations based on the charts and found[1] that for the NVIDIA 40 and 50-series GPUs, the results are almost entirely explained by memory bandwidth:
Each of the cards except the 5090 gets almost exactly 0.1 token/s per GB/s memory bandwidth.
My understanding is that the Macs have soldered memory which allows for much higher memory bandwidth. The M4 has ~400-550 GB/s max depending on configuration[2], while EPYCs seem to have more like 250GB/s max[3].
Ah shoot, that's what one gets for being in a hurry and on the phone. Saw the date of the article and mention of the EPYC 9004, but forgot that it's the 9005 that's the new series and missed the details.
Thanks for the correction.
edit: found a llama.cpp issue discussing performance bottlenecks on modern dual-socket EPYC here[1]. Also includes single-socket benchmarks, and includes some optimizations. Just thought it was interesting.
Each of the cards except the 5090 gets almost exactly 0.1 token/s per GB/s memory bandwidth.
My understanding is that the Macs have soldered memory which allows for much higher memory bandwidth. The M4 has ~400-550 GB/s max depending on configuration[2], while EPYCs seem to have more like 250GB/s max[3].
[1]: https://news.ycombinator.com/item?id=42847284
[2]: https://support.apple.com/en-us/121553
[3]: https://www.servethehome.com/here-is-why-you-should-fully-po...