> you'd expect performance-portability between Graviton ARM and Ampere Altra I a...

dragontamer · on Jan 19, 2021

> Also, AFAIK on ARM the parts where CPUs integrate with the rest of the hardware are custom. The important thing for servers, disk and network I/O differs across ARM chips of the same ISA. Linux kernel abstracts it away i.e. stuff is likely to work, but I’m not so sure about performance portability.

Indeed. But Intel Xeon + Intel Ethernet integrates tightly and drops the Ethernet data directly into L3 cache (bypassing DRAM entirely).

As such, I/O performance portability between x86 servers (in particular: Intel Xeon vs AMD EPYC) suffers from similar I/O issues. Even if you have AMD EPYC + Intel Ethernet, you lose the direct-to-L3 DMA, and will have slightly weaker performance characteristics compared to Intel Xeon + Intel Ethernet.

Or Intel Xeon + Optane optimizations, which also do not exist on AMD EPYC + Optane. So these I/O performance differences between platforms are already on the status-quo, and should be expected if you're migrating between platforms. A degree of testing and tuning is always needed when changing platforms.

--------

>Still, are there many public clouds built of these Ampere Altra-s? Maybe we gonna have them widespread soon, but until then I wouldn’t want to build stuff that only runs on Amazon or my own servers with only a few on the market and not yet globally available on retail.

A fair point. Still, since Neoverse N1 is a premade core available to purchase from ARM, many different companies have the ability to buy it for themselves.

Current rumors look like Microsoft/Oracle are just planning to use Ampere Altra. But like all other standard ARM cores, any company can buy the N1 design and make their own chip.

yaantc · on Jan 19, 2021

> > Also, AFAIK on ARM the parts where CPUs integrate with the rest of the hardware are custom. The important thing for servers, disk and network I/O differs across ARM chips of the same ISA. Linux kernel abstracts it away i.e. stuff is likely to work, but I’m not so sure about performance portability.

> Indeed. But Intel Xeon + Intel Ethernet integrates tightly and drops the Ethernet data directly into L3 cache (bypassing DRAM entirely).

This will be less of a problem on ARM servers as direct access to the LLC from a hardware master is a standard feature of ARM's "Dynamic Shared Unit" or DSU, which is the shared part of a cluster providing the LLC and coherency support. Connect a hardware function to the DSU ACP (accelerator coherency port) and the hardware can control, for all write accesses, whether to "stash" data into the LLC or even the L2 or L1 of a specific core. The hardware can also control allocate on miss vs not. So any high performance IP can benefit from it.

And if I understand correctly, the DSU is required with modern ARM cores. As most (besides Apple) tend to use ARM cores now, you have this in the package.

More details here in the DSU tech manual: https://developer.arm.com/documentation/100453/0002/function...