M3 Ultra has 819GB/s, and a single epyc cpu with 12 channels has 460GB/s. As far as I know, llama.cpp and friends don’t scale across multiple sockets so you can’t use a dual socket Turin system to match the M3 Ultra.
Also, 32GB DDR5 RDIMMS are ~200, so that’s 5K for 24 right there. Then you need 2x CPUs, at ~1K for the cheapest, and you need 2, and then a motherboard that’s another 1K. So for 8K (more, given you need a case, power supply, and cooling!), you get a system with about half the memory bandwidth, much higher power consumption, and very large.
Partial correction, an Epyc CPU with 12 channels has 576 GB/s, i.e. DDR5-6000 x 768 bits. That is 70% of the Apple memory bandwidth, but with possibly much more memory (768 GB in your example).
You do not need 2 CPUs. If however you use 2 CPUs, then the memory bandwidth doubles, to 1152 GB/s, exceeding Apple by 40% in memory bandwidth. The cost of the memory would be about the same, by using 16 GB modules, but the MB would be more expensive and the second CPU would add to the price.
Perhaps this is incorrect now, but I also know with 2x 4090s you don’t get higher tokens per second than 1x 4090 with llama.cpp, just more memory capacity.
(All if this only applies to llama.cpp, I have no experience with other software and how memory bandwidth may scale across sockets)
The memory bandwidth does double, but in order to exploit it the program must be written and executed with care in the memory placement, taking into account NUMA, so that the cores should access mostly memory attached to the closest memory controller and not memory attached to the other socket.
With a badly organized program, the performance can be limited not by the memory bandwidth, which is always exactly double for a dual-socket system, but by the transfers on the inter-socket links.
Moreover, your link is about older Intel Xeon Sapphire Rapids CPUs, with inferior memory interfaces and with more quirks in memory optimization.
about the scaling of llama.cpp and DeepSeek on some dual-socket AMD systems.
While it was rather tricky, after many experiments they have obtained an almost double speed on two sockets, especially on AMD Turin.
However, if you look at the actual benchmark data, that must be much lower than what is really possible, because their test AMD Turin system (named there P1) had only two thirds of the memory channels populated, i.e. performance limited by memory bandwidth could be increased by 50%, and they had 16-core CPUs, so performance limited by computation could be increased around 10 times.
CPUs do not have enough compute typically. You'll be compute bottlenecked before bandwidth if the model is large enough.
Time to first token, context length, and tokens/s are significantly inferior on CPUs when dealing with larger models even if the bandwidth is the same.
One big server CPUs can have a computational capability similar to a mid-range desktop NVIDIA GPU.
When used for ML/AI applications, a consumer GPU has much better performance per dollar.
Nevertheless, when it is desired to use much more memory than in a desktop GPU, a dual-socket server can have higher memory bandwidth than most desktop GPUs, i.e. more than an RTX 4090, and a computational capability that for FP32 could exceed an RTX 4080, but it would be slower for low-precision data where the NVIDIA tensor cores can be used.
True, but I have compared the FP32 used in graphics computations because for that the throughput information is easily available.
Both CPUs (with the BF16 instructions and with the VNNI instructions for INT8 inference) and the GPUs have a higher throughput for lower precision data types than for FP32, but the exact acceleration factors are hard to find.
The Intel server CPUs have the advantage vs. AMD that they also have the AMX matrix instructions, which are intended to compete for inference applications with the NVIDIA tensor cores, but the Intel CPUs are much more expensive for a number of cores big enough to be competitive with GPUs.
The bandwidth difference likely doesn't make a difference though. Benchmarks of Apple Silicon show that the compute bottlenecks far before running out of bandwidth, even when fully loading all CPU cores, the GPU, etc.
Also, 32GB DDR5 RDIMMS are ~200, so that’s 5K for 24 right there. Then you need 2x CPUs, at ~1K for the cheapest, and you need 2, and then a motherboard that’s another 1K. So for 8K (more, given you need a case, power supply, and cooling!), you get a system with about half the memory bandwidth, much higher power consumption, and very large.