So how are the cores interconnected? As ARM server designs love to throw out ten...

SoSoRoCoCo · on Dec 18, 2020

This is a serious problem. On-die interconnect requires insane bandwidth and lots of caches. We're talking TB/s to keep cores fed. This is significant challenge (source: I worked on Intel Xeon Phi) and at the time I worked on it in 2007 (pre-name-change), there was not a lot of empirical data, just lots of simulations, guard-bands and finger crossing.

yvdriess · on Dec 19, 2020

What are guard-bands?

SoSoRoCoCo · on Dec 19, 2020

At the time the validation models were not very mature so they moved everything up to higher metal layers with big-ass drivers and designed for worst-case. I'm sure Intel has a much better handle on it now (they ain't dummies!), but back then the orders were "don't let the ring(s) throttle the cores".

x86_64Ubuntu · on Dec 18, 2020

This is a dumb question. But as you've said, cores per die has a strong influence on how performant a system is and that's because of the processor interconnects. My question is, what are the processors saying to each other?

jcranmer · on Dec 18, 2020

It's not a dumb question; it's actually the defining characteristic of performance for large systems.

In a shared-memory multiprocessor, all cores and all memory are logically connected along a massive shared bus, and both inter-core communication and regular core-to-main-memory communication takes up slots on the bus. In practice, of course, it isn't physically a single bus but a more complex topology (usually meshes, I think) with protocols that make it like a bus for cache coherency purposes.

One of the readily apparent consequences of interconnect is NUMA, non-uniform memory accesses. Historically (my information on this matter is a few years out of date, so I don't know how much this is true today), this has been a bigger issue for AMD than Intel, as AMD had wildly divergent memory latencies (2-3× between worst/best case) even for a 4-core processor. So if the OS decided to resume execution of a sleeping program on a different core, your performance might suddenly drop by ½ or ⅓.

For some benchmarks (such as classic SPEC, or make -j), multicore performance is approximated by running the same binary on every core and there is no sharing between different copies. The interconnect here matters only insomuch as the cross-chatter on the interconnect may impede performance. But HPC applications tend to be much more sensitive to the actual interconnect, since reading and writing from other processors' memory is more common (transferring halo data, or having global data structures that you need to read from and occasionally write).

dragontamer · on Dec 18, 2020

Its not so much "processors" talking to each other as caches.

The basic theory is called "MESI": Modified / Exclusive / Shared / Invalid. When a remote core reads a cache-line, if no one else has that cache-line then its the Exclusive owner. But if multiple caches have the cache-line, then it is a Shared cache line instead. If a remote cache changes, then your current cache-line becomes Invalid, while the cache line that's valid is placed into the Modified state (cache is correct, RAM is incorrect).

MESI isn't how things are done in practice: additional messages and states are piled on top in proprietary ways. But MESI will get you to the "basic textbook" concept at least.

EDIT: Oh, and "snooping". Caches "snoop" on each other's communications, which helps keep the MESI state up to date. Now you can have directory-based snooping (kind of a centralized push), or rings, or whatever. Those details are proprietary and change from chip to chip. In practice, these details are almost never published, so your best bet to understanding is "Something MESI-like is going on..."

tolien · on Dec 18, 2020

In the case of Zen/Zen2, some of the dies didn't have direct access to main memory but went through an adjacent die's memory controller and channel. This had a performance and latency effect, and required some extra work on the part of OS schedulers to make sure that tasks weren't being scheduled on related cores.

It's (relatively) easy to have lots of cores on one die, it's not so easy to keep them fed.

MarioMan · on Dec 18, 2020

Modern CPUs like these are shared memory multiprocessors. Programs on these CPUs are often designed to run on multiple cores at the same time to improve performance, necessitating communication between cores, to coordinate and share work between each other. Since each core typically has a private cache, most communication in a system like this will typically involve cache coherence protocols to ensure cores have a consistent view of memory contents. Adding more cores increases the overall complexity of the system. The latency of these protocols grows with more cores as interconnections become more "distant", across processor interconnects. Minimizing latency and effectively scaling to more cores is a difficult problem to solve while staying within silicon and thermal budgets.

Nabati · on Dec 18, 2020

"knock knock" "who's there?"

smspf · on Dec 18, 2020

ThunderX was a huge disappointment. ThunderX2 (which one may think is the successor of ThunderX, but is actually a completely different system that Cavium obtained by acquiring a different company that was also working on ARMv8 hardware) was a (not so huge) disappointment. Cavium tried to copy-paste lots of coprocessors and offload things from the CPU, but the overall system was not that great.

Early AMD Softiron and Applied Micro boards (which had 8 cores unlike the ThunderX which had 48 or 96) were actually faster, which I always found interesting.

But Ampere's previous generation (before N1) is fast, much faster than the ThunderX2. Afaik, they built it on top of previous Applied Micro IP. So I'd expect N1 to be in a different league and not worth comparing to ThunderX2.

timthorn · on Dec 18, 2020

> But Ampere's previous generation (before N1) is fast, much faster than the ThunderX2.

I think you're getting ThunderX2 and ThunderX mixed up here.

smspf · on Dec 19, 2020

In my experience, for tasks like everyday operation, kernel building etc.: Thunderx < ThunderX2 < eMAG (Ampere's platform before N1).

I wouldn't say it's the same order of magnitude for the 2 comparisons, but it's definitely noticeable.

nine_k · on Dec 18, 2020

Another anecdote: in 1990s, Sun servers had relatively "weak" Sparq CPUs, contemporary Intels would wipe the floor with them in integer performance tests. But they had superior interconnect and I/O channels, so e.g. in 4-core configurations Sparq-based servers showed much better application performance (say, running Oracle RDBMS) than the beefier Intel 4-CPU machines.

The connection fabric is utterly important, and expensive.

dspillett · on Dec 18, 2020

> slower than on my...

May have consumed far less power doing the job though. For server farms performance-per-watt can be far more important than absolute speed per unit, especially for massively parallel tasks.

spamizbad · on Dec 18, 2020

Cavium ThunderX2 is a 205 TDP chip vs a circa 2017 i7 (~105W) so probably not.

sitkack · on Dec 19, 2020

> So how are the cores interconnected?

No one answered that and from the core to core ping-pong test the second processor their is some weird, possibly performance crushing issues.