I've replicated them myself with my own code, so I'm pretty confident. It doesn'...

kllrnohj · on Dec 1, 2020

> Pretty stunning to have latencies that low on a low end $700 mac mini that embarrasses machines that costs 10x that much. Even high end Epyc machines (200 watt TDP) with 8 x 64 bit memory channels have to try hard to get a cacheline every 10ns.

Eh? That's not how memory latency works. The cheaper consumer chips with "non-spec" RAM and without ECC are regularly better here than the enterprise stuff. This isn't something that scales with price.

sliken · on Dec 1, 2020

Sure, ECC and in particular registered memory does increase the latency a bit. But servers are designed for throughput and have multiple memory channels to better feed the large amount of cores involved, up to 64 cores for the new AMD epyc chips. The amazing thing is that the Apple M1 can fetch random cachelines almost as fast as a current AMD Epyc.

kllrnohj · on Dec 1, 2020

You're confusing throughput & latency here. More channels increases throughput, but doesn't improve latency.

The M1's memory bandwidth is ~68GB/s, which is of course a tiny fraction of AMD Epyc's ~200GB/s per socket.

Epyc's latency isn't even competitive with AMD's own consumer parts, so I'm really not sure why you're surprised that Epyc's latency is also worse than the M1's?

sliken · on Dec 1, 2020

I'm not surprised the latency on the M1 is better than Epyc, but it's near half of any other consumer part, like say the AMD Rzyen 5950x. When accessed in a TLB friendly way (not TLB thrashing) the M1 manages 30ns which is excellent.

Even more impressively is that the random cacheline throughput is also excellent. So if all 8 cores have a cache miss the M1 memory system is very good at keeping multiple pending requests in flight to achieve surprisingly good throughput. Granted this isn't pure latency, so I call it throughput. Getting a random cacheline per 12ns is quite good, especially for a cheap low power system. Normally getting more than 2 memory channels on a desktop requires something exotic like an AMD threadripper.