Nvidia Unveils Grace: A High-Performance Arm CPU for Use in Big AI Systems

titzer · on April 12, 2021

Given that there are essentially no architectural details here other than bandwidth estimates, and the release timeline is in 2023, how exactly does this count as "unveiling"? Headline should read: "NVidia working on new arm chip due in two years", or something else much more bland.

mrlento234 · on April 12, 2021

Not quite. CSCS supercomputing center in Switzerland have already started receiving the hardware (https://www.cscs.ch/science/computer-science-hpc/2021/cscs-d...). Perhaps, we may see some benchmarks. To wider HPC users, it will be only available in 2023 as the article mentioned.

bdc-hpc · on April 13, 2021

The Alps system at CSCS will have racks with different processors, to be installed in phases. CSCS has taken delivery of the first racks with AMD EPYC processors, for non-GPU workloads. CSCS will be one of the first customers to get their hands on Grace Hopper, but they will have to wait until 2023.

wombat23 · on April 13, 2021

Are there more sources for technical details about the new infrastructure? The interview linked above left me with more questions than answers.

IanCutress · on April 12, 2021

I suspect that's more racks of storage, not racks of compute. Nothing to suggest it's compute.

seniorivn · on April 12, 2021

as i understand it's compute, just not cpu compute, those cpu are designed to be good enough for cuda servers

DetroitThrow · on April 12, 2021

Hey Ian, I love reading your posts on Anandtech, you're a fantastic technical communicator.

titzer · on April 12, 2021

Hopefully some architectural details are forthcoming then! But that is not what is in this article.

temp667 · on April 12, 2021

The CPU cores are probably not that interesting, it's going to be the GPU and interlink stuff (pretty impressive if true) that's going to drive this?

kats · on April 13, 2021

It says they use Arm Neoverse cores so it is another processor like Fujitsu A64FX and Amazon Graviton 2.

allie1 · on April 12, 2021

As AMD proved us, a lot can happen in 3 years

ksec · on April 12, 2021

Based on Future ARM Neoverse, so basically nothing much to see here from CPU perspective, What really stands out, are those ridiculous number from its Memory system and Interconnect.

CPU: LPDDR5X with ECC Memory at 500+GB/s Memory Bandwidth. ( Something Apple may dip into. R.I.P for Mac with upgradable Memory )

GPU: HBM2e at 2000 GB/s. Yes, three zeros, this is not a typo.

NVLink: 500GB/s

This will surely further solidify CUDA dominance. Not entirely sure how Intel's XE with OneAPI and AMD's ROCm is going to compete.

Dylan16807 · on April 12, 2021

> GPU: HBM2e at 2000 GB/s. Yes, three zeros, this is not a typo.

It's a good step forward but your average consumer GPU is already around a quarter to a third of that and a Radeon VII had 1000 GB/s two years ago.

m_mueller · on April 12, 2021

I think what you’re missing here is the NVLink part. The fact that you can get a small cluster of these linked up like that for 400k, all wrapped in a box, makes HPC quite a bit more accessible. Even 5 years ago, if you wanted to run a regional sized weather model at reasonable resolution, you needed to have some serious funding (say, nation states or oil / insurance companies). Nowadays you could do it with some angel investment and get one of these Nvidia boxes and just program them like they’re one GPU.

lmeyerov · on April 13, 2021

Yep, we generally care about growing a few bandwidth #'s over current:

- GPU<>CPU/RAM

- GPU<>storage

- GPU<>network

(- GPU<>GPU bandwidth is already insane, as is GPU compute speed)

In the above, they're about cases like logs where there is ~infinite off-GPU data (S3, storage, ...), yet current PCI etc CPU stuff is like a tiny straw clogging it all.

It's now ~easy to do stuff like regex search on GPU, so systems being redesigned to quickly shove 1TB through a python 1-liner is awesome.

To get a feel for where all this is in practice, I did a fun talk w/ the pavilion team for this year's GTC on building graphistry UIs & interactive dashboards on top of this: https://pavilion.io/nvidia/

Edit: A good search term here is 'GPU Direct Storage', which is explicitly about skipping the CPU bandwidth indirection & performance handcuffs. Tapping directly into the network or storage is super exciting for matching what the compute tier can do!

kllrnohj · on April 12, 2021

Critically it's CPU to GPU NVLink here, not the "boring" GPU to GPU NVLink that's common on Quadros. 500GB/s bandwidth between CPU & GPU massively changes when & how you can GPU accelerate things, that's a 10X difference over the status quo.

kimixa · on April 12, 2021

Also "cpu->cpu" NVLink is interesting. Though it was my understanding that NVLink is point-to-point, and would require some massive switching system to be able to access any node in the cluster anywhere near that rate without some locality bias (IE nodes on the "first" downstream switch are faster to access and less contention)

wmf · on April 12, 2021

Fortunately NVSwitch already exists so they should be able to reuse that for CPUs.

Dylan16807 · on April 13, 2021

I'm not missing it, I'm saying that a part other then nvlink isn't the important part!

jabl · on April 12, 2021

The Nvidia A100 80GB already provides 2 TB/s mem BW today. Also using HBM2e.

nickflood · on April 13, 2021

RTX 3090 also does 936 GB/s which is very close to Radeon 7 but with conventional GDDR

alexhutcheson · on April 12, 2021

The fact that they are using a Neoverse core licensed from ARM seems to imply that there won’t be another generation for NVidia’s Denver/Carmel microarchitectures. Somewhat of a shame, because those microarchitectures were unorthodox in some ways, and it would have been interesting to see where that line of evolution would have lead.

I believe this leaves Apple, ARM, Fujitsu, and Marvell as the only companies currently designing and selling cores that implement the ARM instruction set. That may drop to 3 in the next generation, since it’s not obvious that Marvell’s ThunderX3 cores are really seeing enough traction to be be worth the non-recurring engineering costs of a custom core. Are there any others?

klelatti · on April 12, 2021

Designing but not yet selling Qualcomm / Nuvia?

alexhutcheson · on April 12, 2021

Yeah will be interesting to see if and when they bring a design to market.

intvocoder · on April 13, 2021

The ThunderX3 team is mostly gone, it's been hollowed out.

modeless · on April 12, 2021

I hope they make workstations. I want to see some competition for the eventual Apple Silicon Mac Pro.

titzer · on April 12, 2021

I think Apple did Arm an unbelievable favor by absolutely trouncing all CPU competitors with the M1. By being so fast, Apple's chip attracts many new languages and compiler backends to Arm that want a piece of that sweet performance pie. Which means that other vendors will want to have arm offerings, and not, e.g. RISCv5.

I have no idea what Apple's plans for the M1 chip are, but if they had manufacturing capacity, they could put oodles of these chips into datacenters and workstations the world over and basically eat the x86 high-performance market. The fact that the chip uses so little power (15W) means they can absolutely cram them into servers where CPUs can easily consume 180W. That means 10x the number of chips for the same power, and not all concentrated in one spot. A lot of very interesting server designs are now possible.

jillesvangurp · on April 13, 2021

I think you are half right in the sense that people now know Intel architectures are not what they want/need. Riscv5 chipsets will take a bit longer to mature but can in principle do the same kinds of things that Apple is doing with M1 to keep energy usage low and throughput high. However, the key selling feature with RiscV5 is reduced IP licensing needs (cost).

With Nvidia, buying Arm and producing their own chip sets, that's no small advantage for companies that are not Nvidia (or Apple who have a perpetual license already). If I were Intel, that's what I'd be looking at right now. Same for perhaps AMD. The clock is ticking on their x86 only strategy and it takes time to develop new architectures; even if you do license somebody else's instruction set.

A counter argument to this would be software compatibility. Most of the porting effort to make linux, windows, and mac os run on Arm has already happened years ago. It's a mature software ecosystem. Software is actually the hardest part of shipping new hardware architectures. Without that, hardware has no value.

And a counter argument to that is that Apple is showing instruction set emulation actually works reasonably well: it is able to run x86 software at reasonable performance on the M1. So, running natively matters less these days. If you look at Qemu, they have some interesting work going on around e.g. emulated GPU where the goal is not to emulate some existing GPU but to create a virtual only GPU device called Virgil 3D that can run efficiently on just about anything that supports opengl. Don't expect to set fps records of course. The argument here is that the software ecosystem is increasingly easy to adapt to new chip architectures as a lot of stuff does not require access to bare metal. Google uses this strategy with Android: native compilation happens (mostly) just in time after you ship your app to the app store.

klelatti · on April 12, 2021

It's hard to imagine that until a few months ago it was very difficult to get a decent Arm desktop / laptop. I imagine lots of developers working now to fix outstanding Arm bugs / issues.

giantrobot · on April 12, 2021

While I'm sure lots of projects have actual ARM-related bugs, there was a whole class of "we didn't expect this platform/arch combination" compilation bugs that have seen fixes lately. It's not that the code has bugs on ARM, a lot of OSS has been compiling on ARM for a decade (or more) thanks to Raspberry Pis, Chromebooks, and Android but built scripts didn't understand "darwin/arm64". Back in December installing stuff on an M1 Mac via Homebrew was a pain but it's gotten significantly easier over the past few months.

But a million (est) new general purpose ARM computers hitting the population certainly affects the prioritizing of ARM issues in a bug tracker.

macintux · on April 13, 2021

When Itanium was newborn, HP enlisted my employer, Progeny, to help port applications to ia64 Linux.

Despite the fact that 64-bit Linux had been running successfully on DEC Alpha systems for years, we ran into no end of difficulty because pointers were truncated all over the place, which apparently hadn't mattered on Alpha systems.

It seems like it must have been an Endian issue, but after 20 years my memories are basically toast. I just know nearly every bug we found was pointer truncation.

don-code · on April 13, 2021

Possibly that 32-bit values on the Alpha were automatically sign-extended to 64 bits? https://devblogs.microsoft.com/oldnewthing/20170807-00/?p=96...

macintux · on April 13, 2021

Ah, fascinating. That seems a likely suspect.

mhh__ · on April 12, 2021

> compiler backends to Arm that want a piece of that sweet performance pie

How many compilers didn't support ARM?

titzer · on April 13, 2021

A lot of hobbyist ones, e.g. But even for mainstream compilers, arm has been a second-class citizen where developers would not necessarily test on arm. E.g. I used to work on V8, and we had partners at ARM who would help support the 32- and 64-bit ports. While I often did go ahead and port my changes to arm, it wasn't always required, as they could do heavy lifting and debugging for us, sometimes. We didn't have arm hardware on our desks to test; V8 literally has its own CPU simulators built into it, just for running the generated code from its own JITs. We had good regression testing infrastructure, but there is nothing quite like having first-class, on-desk hardware to test with, preferrably to develop directly on.

dhruvdh · on April 12, 2021

They are licensing ARM cores; which as of now cannot compete with Apple silicon.

While there are using some future ARM core, and I've read rumors that future designs might try to emulate what has made Apple cores successful; we cannot say whether Apple designs will stagnate or continue to improve at current rate.

There is potential for competition from Qualcomm after their Nuvia acquisition though.

ac29 · on April 12, 2021

Maybe not in single threaded performance, but Apple has no server grade parts. Ampere, for example, is shipping an 80 core ARM N1 processor that puts out some truely impressive multithreaded performance. An M1 Mac is an entirely different market - making a fast 4+4 core laptop processor doesn't neccesarily translate into making a fast 64+ core server processor.

martinald · on April 13, 2021

To be honest it does though. You could take 10 M1 chips (40+40 cores, with around 30TFLOPS of GPU) put them into a server and even at full load you would be at 150W, which is about half of the high core count Xeons. Obviously not as simple as that, but the thermal fundamentals are right.

The 40 core Xeon also costs around 10k.

There's rumors that the new iMac will have a 20 core M1 (16+4). I imagine that will be faster than even the top line $10k Xeon.

I have absolutely no doubt apple could put together a server based on the M1 which would wipe the floor with Intel if they wanted to. But I very much doubt they will since it is so far out of their core competencies these days.

I have absolutely no doubt apple could produce a ridiculously good server CPU from the M1. I doubt they will actually do it though.

physicsguy · on April 14, 2021

> You could take 10 M1 chips (40+40 cores, with around 30TFLOPS of GPU) put them into a server

Not really, part of why the ARM chips are so good is that the memory bandwidth is so fast. With 40+40 cores you're going to have at least NUMA to contend with, which always hampers multithreaded performance.

seniorivn · on April 13, 2021

if they could easily do that(competitor for xeon) they would do it, it's a huge and very stable market, there is no reason to ignore it, if you have such a big advantage.

saagarjha · on April 13, 2021

Other than the fact that Apple ignores markets all the time that it is just not interested in.

devmor · on April 12, 2021

What do you mean ARM cores can't compete with Apple silicon? "Apple silicon" are ARM cores.

mlyle · on April 12, 2021

He means cores made by ARM, not cores implementing the ARM ISA. Currently, the cores designed by ARM cannot touch the Apple M1.

dharmab · on April 12, 2021

Apple Silicon is compatible with the ARM instruction set but they are not "just ARM cores" in their internal design.

adgjlsfhk1 · on April 12, 2021

It seems weird to me to say that arm cores can't compete with apple silicon given that apple doesn't own fabs. They are using arm cores on TSMC silicon (exactly the same as this).

seabrookmx · on April 12, 2021

> They are using arm cores on TSMC silicon (exactly the same as this)

No the Apple Silicon chips use the arm _instruction set_ but they do not use their core design. Apple designs their core in house, much like Qualcomm does with snapdragon. Both of these companies have an architectural license which allows them to do this.

tibbydudeza · on April 12, 2021

Qualcomm no longer makes their own cores - they just use ARM reference IP designs since the Kryo.

That will probably change with their Nuvia acquisition.

macksd · on April 12, 2021

You probably mean less powerful than this, but they do: https://www.nvidia.com/en-us/deep-learning-ai/solutions/work....

modeless · on April 12, 2021

Yes they make workstations, but they don't make ARM workstations. Yet. They already have ARM chips they could use for it, but they went with x86 instead despite the fact that they have to purchase the x86 chips from their direct competitor. Also, yes, less than $100k starting price would be nice.

gchadwick · on April 12, 2021

It'd be interesting to know if NVidia are going for an ARMv9 core, in particular if they'll have a core with an SVE2 implementation.

It may be they don't want to detract from focus on the GPUs for vector computation so prefer a CPU without much vector muscle.

Also interesting that they're picking up an arm core rather than continuing with their own design. Something to do with the potential takeover (the merged company would only want to support so many micro-architectural lines)?

klelatti · on April 12, 2021

This has got me wondering whether an Nvidia owned Arm could limit SVE2 implementations so as not to compete with Nvidia's GPU. That would certainly be the case for Arm designed cores - not a desirable outcome.

MikeCapone · on April 12, 2021

I doubt it, it's not like the market for acceleration is stagnant and saturated and they need to steal some marketshare points from one side to help the other.

It's all greenfield and growing so far, they'll win more by having the very best products they can make on both sides.

mlyle · on April 12, 2021

You'd think. But it wouldn't be the first time a new product is hampered to not slightly theoretically cannibalize an existing product family.

adrian_b · on April 12, 2021

They have said clearly that the core is licensed from ARM and one of the Neoverse future models.

There was no information whether it will have any good SVE2 implementation. On the contrary they insisted only on the integer performance and on the high-speed memory interface.

gchadwick · on April 12, 2021

Here's Anandtech's article on the previous Neoverse V1/N2 announcement: https://www.anandtech.com/show/16073/arm-announces-neoverse-... arm weren't saying anything official but Anandtech did a little digging and reckons V1 is SVE 1 and v8 and N2 could be Armv9 with SVE 2.

I'd suspect NVidia would be using the V1 here as it's the higher performing core, but not way to be certain.

dragontamer · on April 12, 2021

Neoverse V1 has SVE, Neoverse E or N do not.

"E" is efficiency, N is standard, V is high-speed. IIRC, N is the overall winner in performance/watt. Efficiency cores have the lowest clock speed (overall use the least amount of watts/power). V purposefully goes beyond the performance/watt curve for higher per-core compute capabilities

Teongot · on April 12, 2021

Neoverse-N2 will have SVE2 (source https://github.com/gcc-mirror/gcc/blob/master/gcc/config/aar... )

theonlyklas · on April 12, 2021

I think they will use SVE2 because I assume they'll need to perform vector reads/writes to NVLink connected peripherals to reach that 900GB/s GPU-to-CPU bandwidth metric they described.

lprd · on April 12, 2021

So is ARM the future at this point? After seeing how well Apple's M1 performed against a traditional AMD/Intel CPU, it has me wondering. I used to think that ARM was really only suited for smaller devices.

mhh__ · on April 12, 2021

The next decade is ARM's for the taking, but if Intel and AMD can make good cores then it's not anywhere close to slam dunk.

One of the reasons why M1 is good is pure and simple that it has a pretty enormous transistor budget, not solely because it's ARM.

api · on April 12, 2021

Being ARM has something to do with it. The x86 instruction decoder may be only about ~5% of the die, but it's 5% of the die that has to run all the time. Think about how warm your CPU gets when you run e.g. heavy FPU loads and then imagine that's happening all the time. You can see the power difference right there.

It's also very hard to achieve more than 4X parallelism (though I think Ice Lake got 6X at some additional cost) in decode, making instruction level parallelism harder. X86's hack to get around this is SMT/hyperthreading to keep the core fed with 2X instruction streams, but that adds a lot more complexity and is a security minefield.

Last but not least: ARM's looser default memory model allows for more read/write reordering and a simpler cache.

ARM has a distinct simplicity and low-overhead advantage over X86/X64.

pbsd · on April 12, 2021

The x86 decoder is not running all the time; the uops cache and the LSD exist precisely to avoid this. With instructions fed from the decoders you can only sustain 4 instructions per cycle, while to get to 5 or 6 your instructions need to be coming from either the uops cache or the LSD. In the case of the Zen 3, the cache can deliver 8 uops per cycle to the pipeline (but the overall thoughput is limited elsewhere at 6)!

Furthermore, the high-performance ARM designs, starting with the Cortex-A77, started using the same trick---the 6-wide execution happens only when instructions are being fed from the decoded macro-op cache.

mhh__ · on April 12, 2021

The decoder might not be running strictly all the time, but I would wager that for some applications at least it doesn't make much of a difference. For HPC or DSP or whatever where you spend a lot of time in relatively dense loops the uop cache is probably big enough to ease the strain on the decoder, but for sparser code (Compilers come to mind, lots of function calls and memory bound work) I wouldn't be surprised if it didn't make as much difference.

I have vTune installed so I guess I could investigate this if I dig out the right PMCs

pbsd · on April 12, 2021

I agree; compiler-type code will miss the cache most of the time. A simple test with clang++ compiling some nontrivial piece of C++:

                 0      lsd_uops                                                    
     1,092,318,746      idq_dsb_uops                                                  ( +-  0.49% )
     4,045,959,682      idq_mite_uops                                                 ( +-  0.06% )

The LSD is disabled in this chip (Skylake) due to errata, but we can see only 1/5th of the uops come from the uops cache. However, the more relevant experiment in terms of power is how many cycles is the cache active instead of the decoders:

                 0      lsd_cycles_active                                           
       378,993,057      idq_dsb_cycles                                                ( +-  0.18% )
     1,616,999,501      idq_mite_cycles                                               ( +-  0.07% )

The ratio is similar: the regular decoders are not active only around 1/5th of the time.

In comparison, gzipping a 20M file looks a lot better:

                 0      lsd_cycles_active                                           
     2,900,847,992      idq_dsb_cycles                                                ( +-  0.07% )
       407,705,985      idq_mite_cycles                                               ( +-  0.33% )

api · on April 13, 2021

The LSD would have to be handling at least half the instruction stream for this to make a big dent, and it doesn't.

Forget Bitcoin mining... how many tons of CO2 are released annually decoding the X86 instruction set?

ant6n · on April 12, 2021

How can you run 8 instructions at the same time if you only have 16 general purpose registers? You’d have to either be doing float ops or constantly spilling. So I’m integer code, how many of those instructions are just moving data between memory and registers (push/pop?).

I’d say ARM has a big advantage for instruction level parallelism with 32 registers.

mhh__ · on April 12, 2021

Register renaming for a start, and this is about decoding not execution

ant6n · on April 12, 2021

Okay fair. But the bigger subject is inherent performance advantage of the architecture. You don’t just want to decode many instructions per cycle, you also want to issue them. So decoding width and issuing width are related.

And it seems to me that ARM has an advantage here. If you want execute 8 instructions in parallel, you gotta actually have 8 independent things that need to get executed. I guess you could have a giant out of order buffer, and include stack locations in your register renaming scheme, but it seems much easier to find parallelism if a bunch of adjacent instructions are explicitly independent. Which is much easier if you have more registers - the compiler can then help the cpu keeping all those instruction units fed.

wtallis · on April 12, 2021

In practice, it appears that even though Apple is using the ARM instruction set, they are still relying on truly massive reorder buffers.

d110af5ccf · on April 13, 2021

You seem to have several fairly fundamental misunderstandings about CPUs at a low level.

> include stack locations in your register renaming scheme

Registers aren't related to the stack. "The" stack is just RAM being accessed in a specific cache friendly pattern, with additional optimizations (if you use specific registers) from the hardware in the form of the stack engine. The compiler explicitly loads and stores to and from the registers named by the ISA. Register renaming has absolutely nothing to do with the stack.

When the CPU can tell that a later instruction doesn't depend on the previous value of a register, it's free to rename it. The result is that two independent registers get used even though only one was ever directly referenced. In reality, there are a _huge_ number of registers available on modern processors. Estimates place Skylake, Zen, and Cortex-X1 at 200+, with the M1 at 600+. The ISA just doesn't provide a way to access them directly. (If you want to read about this, the term to look up is reorder buffer.)

Also, there is a giant out of order buffer for stores waiting to be written back to L1. That buffer does indeed have to keep track of cache locations, which directly map to memory addresses, which sometimes happen to refer to stack locations. So in a sense, what you suggested already exists. (If you want to read about this, the term to look up is store buffer.)

> it seems much easier to find parallelism if a bunch of adjacent instructions are explicitly independent

That would indeed make things simpler in some cases. However, many operations such as loading a value into a register (ex mov, [addr]) or zeroing it (ex xor eax, eax) explicitly break the dependency chain by definition. Cases where the CPU fails to properly account for this are documented as false dependencies.

> the compiler can then help the cpu keeping all those instruction units fed

The "compiler handles ordering" thing was tried with Itanium. It seems it didn't go so well.

The CPU is free to simultaneously load two different pieces of data into the "same" register and execute two independent instruction streams on that "single" register thanks to renaming. Speculative execution helps when the CPU can't be completely certain that there isn't a dependency.

For particularly complicated sequences, the compiler spilling due to running out of named registers could indeed pose an issue. However, the CPU is free to elide a store followed by a load if it determines that the address is the same. (If you want to read about this, terms to look up include store-to-load forwarding and load-hit-store.)

ant6n · on April 13, 2021

If you elide a store followed by a load, you can effectively treat memory as registers and include them in your renaming scheme.

I know Itanium didn’t work - but that’s because here the compiler is supposed to do all the reordering work. That’s different from allowing the compiler to explicitly define that instructions are independent by having more registers.

d110af5ccf · on April 13, 2021

The operations are somewhat different though. Store-to-load forwarding is more complicated and doesn't completely eliminate the operation, it just significantly reduces the cycle count when successful.

Although apparently Zen 2 changed this and can pull off zero latency. (https://www.agner.org/forum/viewtopic.php?t=41)

Some general background: (https://travisdowns.github.io/blog/2019/06/11/speed-limits.h...)

ant6n · on April 13, 2021

Lets just pick a simple example, for an inner loop

    a = m[i+1] + b
    c = m[i+3] + c
    e = m[i+7] + d

assume you only have 3 registers, in a RISKy architecture. Every statement becomes something like

    r1 = *pb      // load c
    r2 = r0[1]    // m[m+1]
    r1 = r1 + r2  // a = ...
    *pa = r1

Since all registers are used, and all but two instructions are dependent, in the assembly the blocks have to follow one another. There`s also spilling of the b,c,d variables, they have to be read from registers (which could be elided). Assuming no re-order buffer, these instructions runs in three cycles (the first two are independent) - even though the top level instructions are independent.

If you want them to run all statements with 4 instructions at a time, you need to have a reorder buffer that covers the whole sequence (12 instructions). (Imagine if b,c,d get modified inside the inner loop and spilled into memory, you have to track memory locations in order to do register renaming.)

Now lets assume you have 6 registers. Now all variables fit in registers and the compiler can easily interleave the code giving a sequence of 3 or 4 independent instructions at a time. If you want to run 4 instructions at the same time, you need no reorder buffer.

This is a kind of specific example, but it shows that if you have more registers (i.e. ARM vs x86), the compiler can more easily interleave instructions, which can help reduce the number of instructions that need to be in the reorder buffer. Or with the same size re-order buffer, its easier to find more independent instructions and keep all the execution units fed. Or, when jumping to some code thats not in pipeline or icache, it allows to sooner run more instructions in parallel, when only a small number of instructions are decoded and in the re-order buffer.

d110af5ccf · on April 14, 2021

I really don't see what you're getting at here. Even limited to only three named registers I don't think the example you provided would pose an issue on x86. (I'm not very familiar with ARM but I don't think it would pose any issue there either.)

In practice, x86_64 works just fine for HPC number crunching code. Outside of some serious number crunching, when are you going to have more live values than named registers, have instruction streams whose output depends on _all_ of those values (which is why they would be live), and also those streams complete so quickly that you stall on the next set of loads? And you have absolutely no other useful work to do? Honestly I think you're being silly.

Historically, I understand that the 32 bit version of x86 did have scheduling challenges surrounding function calls. The 64 bit version of the ISA expanded the number of named registers and (as far as I understand things) it largely resolved the issue.

Also note that typical hardware can sustain a surprisingly large number of loads per clock. You just need to find something useful to do while you wait for the load to complete. In case you really can't there's also SMT. Really though, the PRF and ROB are only so large.

> If you want to run 4 instructions at the same time, you need no reorder buffer.

You always need a reorder buffer if you want to achieve good performance. Among other issues, the compiler can't predict the latency for each load in advance due to caching behavior depending on the runtime state of the full computer system. I previously mentioned Itanium. It's directly relevant here.

> Imagine if b,c,d get modified inside the inner loop and spilled into memory, you have to track memory locations in order to do register renaming.

No. You can't just rename registers any longer. A store to memory means the memory model for the ISA gets involved. Things become significantly more complicated. The store buffer exists specifically to deal with such issues efficiently on an OoO core. Seriously, go read about it. It's astoundingly complicated for any OoO core regardless of the ISA.

> the compiler can more easily interleave instructions, which can help reduce the number of instructions that need to be in the reorder buffer

Unless I have a serious misunderstanding (I don't design hardware, so I might) everything passes through the reorder buffer. Every instruction is speculative until all previous instructions have retired. (https://news.ycombinator.com/item?id=20165289)

NortySpock · on April 12, 2021

> x86 instruction decoder may be only about ~5% of the die

What percent of the die is an ARM instruction decoder?

duskwuff · on April 12, 2021

Much less. x86 instruction decoding is complicated by the fact that instructions are variable-width and are byte-aligned (i.e. any instruction can begin at any address). This makes decoding more than one instruction per clock cycle complicated -- I believe the silicon has to try decoding instructions at every possible offset within the decode buffer, then mask out the instructions which are actually inside another instruction.

ARM A32/A64 instruction decoding is dramatically simpler -- all instructions are 32 bits wide and word-aligned, so decoding them in parallel is trivial. T32 ("Thumb") is a bit more complex, but still easier than x86.

monocasa · on April 12, 2021

I totally agree with the core of your argument (aarch64 decoding is inherently simpler and more power efficient than x86), but I'll throw out there that it's not quite as bad as you say on x86 as there's some nonobvious efficiencies (I've been writing a parallel x86 decoder).

What nearly everyone uses is a 16 byte buffer aligned to the program counter being fed into the first stage decode. This first stage, yes has to look at each byte offset as if it could be a new instruction, but doesn't have to do full decode. It only finds instruction length information. From there you feed this length information in and do full decode on the byte offsets that represent actual instruction boundaries. That's how you end up with x86 cores with '4 wide decode' despite needing to initially look at each byte.

Now for the efficiencies. Each length decoder for each byte offset isn't symmetric. Only the length decoder at offset 0 in the buffer has to handle everything, and the other length decoders can simply flag "I can't handle this", and the buffer won't be shifted down past where they were on the next cycle and the byte 0 decoder can fix up any goofiness. Because of this, they can

* be stripped out of instructions that aren't really used much anymore if that helps them

* can be stripped of weird cases like handling crazy usages of prefix bytes

* don't have to handle instructions bigger than their portion of the decode buffer. For instance a length decoder starting at byte 12 can't handle more than a 4 byte instruction anyway, so that can simplify it's logic considerably. That means that the simpler length decoders end up feeding into the higher stack up full decoder selection, so some of the overhead cancels out in a nice way.

On top of that, I think that 5% includes pieces like the microcode ROMs. Modern ARM cores almost certainly have (albeit much smaller) microcode ROMs as well to handle the more complex state transitions.

Once again, totally agreed with your main point, but it's closer than what the general public consensus says.

ant6n · on April 12, 2021

I wonder whether a modern byte-sized instruction encoding would sort of look like Unicode, where every byte is self synchronizing... I guess it can be even weaker than that, probably only every second or fourth byte needs to synchronize.

monocasa · on April 13, 2021

Honestly, I think modern (meaning wide, multiple instruction decoders, and designed today without back compat concerns) and byte-sized are sort of mutually exclusive. Most of those ISAs were designed around 8-bit data buses, and having simple ops only consume a single memory read cycle was pretty paramount to competitive performance. Without that constraint, there's probably better options.

IMO, you would either go towards bitaligned instructions like the iAPX 432 or the Mill, or 16-bit aligned variable width instructions like the s360 and m68k on the CISC side, and ARM Thumb and RV-C on the RISC side.

That being said, you're definitely thinking about it the right way. Modern Istream bandwidth conscious ISAs absolutely (and perhaps unsurprisingly) look at the problem from a constrained, poor man's huffman encoding perspective similar to how UTF-8 was conceived.

ant6n · on April 14, 2021

Interestingly Thumb2 was dropped when going from Arm32 to Arm64. Perhaps the encoding was getting really complicated, and would've been even harder with 32 registers, and not being able to save a lot of memory (if many instructions use 4 bytes anyway).

Maybe one could come up with an instruction encoding that encodes some number of instructions per cache line. Every time the cpu jumps to a new instruction (at cache line address + index), the whole cache line needs to be loaded into icache anyway, and could get decoded then -> internally they get represented in microcode anyway.

astrange · on April 13, 2021

> x86 instruction decoding is complicated by the fact that instructions are variable-width and are byte-aligned (i.e. any instruction can begin at any address).

This is also not a good security property since it means you can hide secret instructions in a program by jumping into the middle of innocuous ones.

> ARM A32/A64 instruction decoding is dramatically simpler -- all instructions are 32 bits wide and word-aligned, so decoding them in parallel is trivial. T32 ("Thumb") is a bit more complex, but still easier than x86.

A64 doesn't have a Thumb equivalent, also, and supporting A32/T32 is optional.

mhh__ · on April 12, 2021

This is why I said it's ARM's for the taking.

I'm not familiar with how ARM's memory model effects the cache design - Source?

tambourine_man · on April 12, 2021

>…is pure and simple that it has a pretty enormous transistor budget

There's a lot of brute force, yes, but it's not the only reason. There are lots of smart design decisions as well.

mhh__ · on April 12, 2021

"One of the reasons" I did say.

tambourine_man · on April 12, 2021

True, I misread it.

amelius · on April 12, 2021

Yes, but those decisions optimize for the single user laptop case, not for e.g. servers.

phendrenad2 · on April 12, 2021

It really comes down to how well they can emulate X86. People aren't going to give up access to 3 decades of Windows software.

pjerem · on April 12, 2021

I'm sure ARM already took over x86 if you have a wider definition of personal computers. And a lot of people already gave up access to 3 decades of Windows software by using their phone or tablet as their main device.

Plus, most of the last decade software is software that runs on some sort of VM or another (be it JVM, CLR, a Javascript engine or even LLVM).

Soon (in years), x86 will only be needed by professionals that are tied to really old software. And those particular needs will probably be satisfied by decent emulation.

kllrnohj · on April 12, 2021

> Soon (in years), x86 will only be needed by professionals that are tied to really old software.

There's also that PC & console gaming markets, which are not small and have not made any movements of any kind towards ARM so far.

_ph_ · on April 12, 2021

Except of course the Nintendo Switch and the handheld devices like the 3DS. They are all ARM-based.

astrange · on April 13, 2021

LLVM isn't actually a VM, it's a compiler IR with good marketing. LLVM programs are architecture specific, although of course ARM64 and x86-64 are pretty similar.

ravi-delia · on April 12, 2021

I've seen things like this a lot, and it's a bit confusing. If parts of the M1's performance are due to throwing compute at the problem, why hasn't Intel been doing that for years? What about ARM, or the M1, allowed this to happen?

NathanielK · on April 12, 2021

Intel has. Many M1 design choices are fairly typical for desktop x86 chips, but unheard of with ARM.

For example, the M1 has 128 bit wide memory. This has been standard for decades on the desktop(dual channel), but unheard of in cellphones. The M1 also has similar amounts of cache to the new AMD and Intel chips, but thats several times more than the latest snapdragon. Qualcomm also doesn't just design for the latest node. Most of their volume is on cheaper, less dense nodes.

ravi-delia · on April 14, 2021

So from this (and some other places), it kind of seems like ARM has been competitive for a long time, but for power and temperature saving its been fighting with one hand behind its back. That's intriguing in its own right, but I'm still confused as to what the actual differences are. Like the M1 runs as fast as current gen x86 processors, while running cooler. How?

NathanielK · on April 26, 2021

> Like the M1 runs as fast as current gen x86 processors, while running cooler. How?

The M1 is one "node" ahead. Apple forked out the cash to get all their chips on TSMC's 5nm process. This is about 2 years of advancement over the 7nm process AMD pays TSMC for. Intel's latest 10nm node is similarly behind TSMC 5nm.

Semiconductors are tricky. Small performance gains take large increases in power. If you play with overclocking, you'll learn power increases quadratically or even cubically with clocks. The mere "2nm" shrink may seem inconsequential, but for these iso-perfomance comparisons(performance@constant-thermals), it is key.

All this to say, you get what you pay for. Chips can get the same performance on TSMC's 5nm node while using 70% of the power as chips on the 7nm node.[1] Compared to TSMC's 10nm (similar to Intel's popular 14nm still in production), 5nm chips can be expected to use ~45% of the power.

Hopefully that shed some light on the M1's biggest advantage for you.

[1] https://images.anandtech.com/doci/15219/wikichip_tsmc_logic_...

dpatterbee · on April 12, 2021

Buying the majority of TSMC's 5nm process output helped. It's a combination of good engineering, the most advanced process, and intel shitting themselves I would say.

jayd16 · on April 12, 2021

Another reason is the something like 150% memory bandwidth and I'm sure there are other simple wins along those lines.

The M1 isn't necessarily a win for Arm in general. Other manufacturers weren't competing before and its yet to be seen if they will.

mhh__ · on April 12, 2021

It's the memory stupid!

to11mtm · on April 12, 2021

Specifically, the memory -latency-.

By going on-package there's almost certainly latency advantages in addition to the much-vaunted bandwidth gains.

That's going to pan out to better perf, and likely better power usage as well.

mhh__ · on April 12, 2021

Is M1 on package? Or is it on-die that people keep getting wrong

als0 · on April 13, 2021

The memory and CPU are on the same package, not die.

wmf · on April 12, 2021

M1 memory latency isn't noticeably lower than x86.

NathanielK · on April 12, 2021

150% compared to what?

jayd16 · on April 12, 2021

The latest i9 and the latest Ryzen 9, ie the competition.

NathanielK · on April 12, 2021

Intel Tigerlake and Amd Renoir both support 128bit LPDDR4x at 4266MHz. Maybe you're confusing the desktop chips that use conventional DDR4? The M1 isn't competitive with them.

jayd16 · on April 12, 2021

Oh those are pretty new and I haven't seen any benchmarks with LPDDR in an equivalent laptop chip. Do you have a link to any?

kllrnohj · on April 12, 2021

It will come down entirely to who can sustain a good CPU core.

Currently Apple is the only company making performance-competitive ARM cores that can make a reasonable justification for an architecture switch.

Otherwise AMD's CPUs are still ahead of everyone else, including all other ARM CPU cores not made by Apple. And even Intel is still faster in places where performance matters more than power efficiency (eg, desktop & PC gaming)

floatboth · on April 12, 2021

Arm's Neoverse cores are doing pretty well in the datacenter space — on AWS, the Graviton2 instances are currently the best ones for lots of use cases. It's clear that core designs by Arm are really good. The problem currently is the lag between the design being done and various vendors' chips incorporating it.

upd: oh also in the HPC world, Fujitsu with the A64FX seems to be like the best thing ever now

kllrnohj · on April 12, 2021

Graviton2 is competitive sometimes with Epyc, but also falls far behind in some tests (eg, Java performance is a bloodbath). Overall across majority tests, Neoverse consistently comes up short of Milan even when Neoverse is given a core-count advantage. And critically the per-core performance of Graviton2 / Neoverse is worse, and per-core performance is what matters to consumer space.

But it can't just be competitive it needs to be significantly better in order for the consumer space to care. Nobody is going to run Windows on ARM just to get equivalent performance to Windows on X86, especially not when that means most apps will be worse. That's what's really impressive about the M1, and so far is very unique to Apple's ARM cpus.

> oh also in the HPC world, Fujitsu with the A64FX seems to be like the best thing ever now

A64FX doesn't appear to be a particularly good CPU core, rather it's a SIMD powerhouse. It's the AVX-512 problem - when you can use it, it can be great. But you mostly can't, so it's mostly dead weight. Obviously in HPC space this is different scenario entirely, but that's not going to translate to consumer space at all (and it's not an ARM advantage, either - 512bit SIMD hit consumer space via x86 first with Intel's Rocket Lake).

klelatti · on April 12, 2021

Not sure why you're placing so much weight on Epyc outperforming Graviton but discounting designs / use cases where Arm is clearly now better. Plus it's clear that we are just at the beginning of a period where some firms with very deep pockets are starting to invest seriously in Arm on the server and the desktop.

If x64 ISA had major advantages over Arm then that would be significant, but I've not heard anyone make that case: instead it's a debate about how big the Arm advantage is.

Can x64 remain competitive in some segments: probably and inertia will work in its favour. I do think it's inevitable that we will see a major shift to Arm though.

rubatuga · on April 12, 2021

Fujitsu flying under the radar while having the fastest cpu ever made haha

huac · on April 12, 2021

so then we think about what makes Apple's M1 so good. one hard-to-replicate factor is that they designed their hardware and software together, the ops which MacOS uses often are heavily optimized on chip.

but one factor that you can replicate is colocating memory, CPU, and GPU, the system-on-chip architecture. that's what Nvidia looks to be going after with Grace, and I'm sure they've learned lessons from their integrated designs e.g. Jetson. very excited to see how this plays out!

kllrnohj · on April 12, 2021

> one hard-to-replicate factor is that they designed their hardware and software together, the ops which MacOS uses often are heavily optimized on chip.

Not really, they are still just using the same ARM ISA as everyone else. The only hardware/software integration magic of the M1 so far seems to be the x86 memory model emulation mode, which others could definitely replicate.

> but one factor that you can replicate is colocating memory, CPU, and GPU, the system-on-chip architecture.

AMD introduced that in the x86 world back in 2013 with their Kavari APU ( https://www.zdnet.com/article/a-closer-look-at-amds-heteroge... ), and it's been fairly typical since then for on-die integrated GPUs on all ISAs.

aeyes · on April 12, 2021

Amazons ARM chips are performance competitive as well, for many workloads you can expect at least similar performance per core at the same clock speed.

mr_toad · on April 12, 2021

> I used to think that ARM was really only suited for smaller devices.

The current fastest supercomputer uses ARM.

https://en.wikipedia.org/wiki/Fugaku_(supercomputer)

CalChris · on April 12, 2021

Apple isn't entering the cloud market. Moreover the M1 isn't a cloud cpu. The M1 SOC emphasizes low latency and performance per watt over throughput.

enos_feedler · on April 13, 2021

AWS has Mac mini, and is expected to add M1 mini into the mix [1]. I expect Apple to take lots of silicon design into data centers and edge computing. Over time I can see a lot of mobile apps running backend through Apple silicon with a full Apple cloud software stack to provide data management around security and privacy.

1. https://9to5mac.com/2021/02/02/m1-mac-mini-in-the-cloud/

CalChris · on April 13, 2021

That’s the Mini. That’s not the M1.

rdsnsca · on April 13, 2021

The Mini has an M1 cpu

CalChris · on April 13, 2021

No, the Mac Mini available on AWS does NOT have an M1. It has an Intel Core i7.

https://aws.amazon.com/ec2/instance-types/mac/

enos_feedler · on April 13, 2021

It has been said AWS will add M1.

fulafel · on April 12, 2021

The instruction set doesn't make a significant difference technically, the main things about them are monopolies (patents) tied to ISAs, and sw compatibility.

rvanlaar · on April 12, 2021

I'm interested in your thoughts on why this doesn't make a significant difference. From what I've read, the M1 has a lot of tricks up its sleeve that are next to impossible on X86. For example ARM instructions can be decoded in parallel.

fulafel · on April 13, 2021

Instruction decoding is more power efficient on arm, but x86 has solved it as a perf bottleneck, with the trace/uop caches and by doing some speculative work in the decoders. (Parallel decoding is also old hat and not a M1 or ARM land invention, it's trivial with RISC style insn format.). What other tricks do you have in mind?

More broadly, as to why the ISA doesn't make a big difference: The major differences are at the microarchitecture level since OoO processors have such flexible dataflow machinery in them that you can kind of view the frontend as compiler technology. x86 and ARM are decades-old ISAs that have seen a many many rounds of iteration in form of added instructions and even backwards incompatible reboots at the 64-bit transition points so most hinderances have been fixed.

In the olden days ISAs were important because processors were orders of magniture simpler, and instructions were processed as-is very statically (to the point that microarchitectural artifacts like branch delay slots were enshrined in some ISAs). This meant that eg the complexity of individual instructions could a bottleneck to how fast a chip could be clocked. Or in CISC land your ISA might have been so complex that the CPU was a microcoded implementation of the ISA and didn't have any hardwired fast instructions...

bitwize · on April 12, 2021

> So is ARM the future at this point?

The near future. A few years out, RISC-V is gonna change everything.

dkjaudyeqooe · on April 12, 2021

ARM is the present, RISC-V is the future and Intel is the past.

The magic of Apple's M1 comes from the engineers who worked on the CPU implementation and the TSMC process.

The architecture has some impact on performance but I think it is simplicity and and ease of implementation that factors most into how well it can perform (as per the RISC idea). In that sense Intel lags for small, fast and efficient processors because their legacy architecture pays a penalty for decoding and translation (into simpler ops) overhead. Eventually designs will abandon ARM for RISC-V for similar reasons as well as financial ones.

Really, today it's a question of who has the best implementation of any given architecture.

remexre · on April 12, 2021

> Today at GTC 2021 NVIDIA announces its first CPU

Wait, Nvidia's been making ARM CPUs for years now; most memorably Project Denver.

jdsully · on April 12, 2021

NVIDIA called it their first “data center CPU”. Our helpful reporter simplified it to the point of being flat out wrong. Not uncommon.

justin66 · on April 12, 2021

I expected more from a site called VideoCardz.

015a · on April 12, 2021

Arguably, most memorably, Tegra; the CPU/GPU which powers the Nintendo Switch.

Jasper_ · on April 12, 2021

That uses a licensed ARM Cortex design under the hood.

uxp100 · on April 12, 2021

The Tegra line included Denver and Carmel cores. Tegra was the product line, then the Switch chips have their own names.

api · on April 12, 2021

Tangent: Apple should bring back the Xserve with their M1 line, or alternately license the M1 core IP to another company to produce a differently-branded server-oriented chip. The performance of that thing is mind blowing and I don't see how this would compete with or harm their desktop and mobile business.

AnthonyMouse · on April 12, 2021

The cheapest available Epyc (7313P) has 16 cores and dual socket systems have up to 128 cores and 256 threads. Server workloads are massively parallel, so a 4+4 core M1 would be embarrassed and Apple wouldn't want to subject themselves to that comparison.

But another reason they won't do it is that TSMC has a finite amount of 5nm fab capacity. They can't make more of the chips than they already do.

api · on April 12, 2021

I'm thinking of a 64-core M1. It would not be the laptop chip.

ac29 · on April 12, 2021

A 4+4 core M1 is 16 billion transistors. Some of that is the little cores, GPU, etc, but its not clear to me its practical to get, say 8x larger. That would be 128 billion transistors. As a point of comparison, NVIDIA's RTX 3090 is 28B transistors, and thats a huge, expensive chip.

_ph_ · on April 12, 2021

I am also hoping for a return of the Xserve once Apple makes high-corecount variations of the Apple Silicon for the Mac Pro. This would have several benefits. First of all, it would greatly increase the production count of that variant, it could be too expensive to make such a chip just for the Mac Pro. In any case, it should be cheaper than an equivalent Intel CPU as Apple would not have to pay for Intels profits. And finally, just the power savings for the vast compute centers Apple operates should mean a lot of money saved too.

bombcar · on April 12, 2021

How much of that performance is on-chip memory and how usable/scalable is that? An Xserve that is limited to one CPU and can't have more RAM would pretty mediocre.

filereaper · on April 12, 2021

Looks like NVidia broke up with POWER on IBM and made their own chip.

They have interconnects from Mellanox, GPUs and their own CPUs now.

I suspect the supercomputing lists will be dominated by NVidia now.

physicsguy · on April 14, 2021

IBM have basically hollowed out their team, so I'd say it's IBM ditching the market more than anything... our centre would not now consider POWER even though we currently have nodes.

arcanus · on April 12, 2021

That is certainly the trend. AMD is bringing Frontier online later this year, which might be the only counter to this.

Aissen · on April 12, 2021

GPU-to-CPU interface >900GB/sec NVLink 4. What kind of interconnect is that ? Is that even physically realistic ?

robomartin · on April 12, 2021

Well, PCIe 6 x16 will do 128 GB/s. Of course, the real question is how many transactions per second you get. For the PCIe 6 16 lanes it's about 64 GT/s.

Speaking in general terms, data rate and transaction rate don't necessarily match because a transaction might require the transmitter to wait for the receiver to check packet integrity and then issue acknowledgement to the transmitter before a new packet can be sent.

Yet another case, again, speaking in general terms, would be the case of having to insert wait states to deal with memory access or other processor architecture issues.

Simple example, on the STM32 processor you cannot toggle I/O in software at anywhere close to the CPU clock rate due to architectural constraints (to include the instruction set). On a processor running at 48 MHz you can only do a max toggle rate of about 3 MHz (toggle rate = number of state transitions per second).

jabl · on April 13, 2021

> Speaking in general terms, data rate and transaction rate don't necessarily match because a transaction might require the transmitter to wait for the receiver to check packet integrity and then issue acknowledgement to the transmitter before a new packet can be sent.

PCIe has the optional "relaxed ordering" feature, allowing sending new packets before the ACK has been received from preceeding ones. Not sure precisely how this works, if there is some TCP-like window scaling algorithm in play or not..

rincebrain · on April 12, 2021

Well, according to [1], NVIDIA lists NVLink 3.0 as being 50 Gb/s per lane per direction, and lists the total maximum bandwidth of NVSwitch for Ampere (using NVLink 3.0) as 900 GB/s each direction, so it doesn't seem completely out of reach.

[1] - https://en.wikipedia.org/wiki/NVLink

Aissen · on April 12, 2021

With 50Gb/s per lane, that would be 144 lanes to reach 900GB/s. Quite impressive.

rincebrain · on April 12, 2021

Fascinatingly, NVIDIA's own docs [1] claim GPU<->GPU bandwidth on that device of 600 GB/s (though they claim total aggregate bandwidth of 9.6 TB/s). Which would be what, 96 and 1536 lanes, respectively? That's quite the pinout.

[1] - https://www.nvidia.com/en-us/data-center/nvlink/

freeone3000 · on April 12, 2021

Depends on how big you want to make it. If they're willing to go four inches, that'd do it with existing per-pin speeds from NVLink 3.

valine · on April 12, 2021

I like the sound of a non-Apple arm chip for workstations. Given my positive experience with the M1 I'd be perfectly happy never using x86 again after this market niche is filled.

awill · on April 12, 2021

Me too. But my decades old steam collection isn't looking forward to it. That's one advantage of cloud gaming. It won't matter what your desktop runs on.

webaholic · on April 12, 2021

I don't think this will be anywhere near as good as the M1, since they are using the ARM Neoverse cores.

ac29 · on April 12, 2021

Apple throws a lot of transistors at their 4 performance cores in the M1 to get the performance they do - its not clear that approach would realistically scale to a workstation CPU with 16, 32, or more cores (at least not with current fab capabilities).

CalChris · on April 12, 2021

Grace, in contrast, is a much safer project for NVIDIA; they’re merely licensing Arm cores rather than building their own ...

NVIDIA is buying ARM.

klelatti · on April 12, 2021

Trying to buy Arm.

Multiple competition investigations permitting.

rektide · on April 12, 2021

There's a lot of interconnects (CCIX, CXL, OpenCAPI, NVLink, GenZ) brewing. Nvidia going big is, hopefully, a move that will prompt some uptake from the other chip makers. 900GBps link, more than main memory: big numbers there. Side note, I miss AMD being actively involved with interconnects. InfinityFabric seems core to everything they are doing, but back in the HyperTransport days it was something known, that folks could build products for, interoperate with. Not many did, but it's still frustrating seeing AMD keeping cards so much closer to the chest.

rektide · on April 13, 2021

lot of downvotes. anyone want to say any reason why they think this deserves a downvote? very unclear to me. do you all just not have the historical context? what's wrong here? give me some hints why you don't get what i'm saying here.

pezezin · on April 13, 2021

That's something weird I have noticed about HN, sometimes perfectly reasonable comments are downvoted to hell without any reply. At least in the good ol' Slashdot days you will get the reason why you got downvoted, now... nothing.

Bluestein · on April 12, 2021

The whole combination of AI and the name gives "watched over by machines of loving grace" a whole new twist, eh?

cma · on April 12, 2021

Real business-class features we want to know about:

Will they auto-detect workloads and cripple performance (like the mining stuff recently)? Only work through special drivers with extra licensing feeds depending on the name of the building it is in (data center vs office)?

rubatuga · on April 12, 2021

Market segmentation is practiced by every chip company that you use. Intel: ECC. AMD: ROCM. Qualcomm: cost as percentage of the phone price.

cma · on April 12, 2021

I still think Nvidia takes it further.

volta83 · on April 12, 2021

Every company does market segmentation: it makes sense to have customers that want a feature pay more for it.

Still, every company does it differently.

For example, both NVIDIA and AMD compute GPUs are necessarily more expensive than gamer GPUs because of hardware costs (e.g. HBM).

However, NVIDIA gamer GPUs can do CUDA, while AMD gamer GPUs can't do ROCm.

The reason is that NVIDIA has 1 architecture for gaming and compute (Ampere), while AMD has two different architectures (RDNA and CDNA).

cma · on April 12, 2021

It's common, but only possible in a very dominant position or with competitors that are borderline colluding.

volta83 · on April 12, 2021

You must be the only gamer in the world that wants an HBM2e GPU for gaming that's 10x more expensive while only delivering a negligible improvement in FPS.

cma · on April 12, 2021

I'm only talking about driver/license locks, not different ram types.

volta83 · on April 13, 2021

Can the CDNA GPUs from AMD even connect to a monitor ?

I don't think they even have display ports.

Not sure what good would a "gaming driver" do you on those cards.

Same for the opposite. Do the RDNA GFX cards have even hardware for compute? They don't even have tensor cores, so why would AMD invest money into creating a compute driver for hardware that's bad at compute?

> I'm only talking about driver/license locks,

Not "locked" is a big understatement. A driver release for some hardware needs at least some QA, so the assumption that doing this is just "free" because its software is incorrect.

cma · on April 14, 2021

>A driver release for some hardware needs at least some QA, so the assumption that doing this is just "free" because its software is incorrect.

Nvidia detects mining workloads in software based on heuristics and disables them. Probably causes more support burden than less and took extra engineering time to implement, not less.

volta83 · on April 14, 2021

> Probably causes more support burden than less and took extra engineering time to implement, not less.

Given how well that turn out, I have a hard time believing they put any effort into this.

temp667 · on April 12, 2021

I know we are going to hear from the Apple haters soon or those that don't like what apple is doing (modular upgradeable systems going away) BUT it seems like Apple is moving in a similar direction as nvidia.

Apple is also I think going to soldered on / close in RAM. Nvidia looks to be doing this two CPU / GPU / Ram all close together and it doesn't look like any upgrade options. Some thinking was that Apple was continuing to increase durability / reliability etc with their RAM move.

Does anyone know requirements for the LPDDR5X type of ram mentioned here. Does this require soldering things (you obviously get lots more control if you spec chips yourself and solder on)?

lprd · on April 12, 2021

So is ARM the future at this point? After seeing how well Apple's M1 performed against a traditional AMD/Intel CPU, it has me wondering. I used to think that ARM was really only suited for smaller devices.

hilios · on April 12, 2021

Depends, performance wise it should be able to compete with or even outperform x86 in many areas. A big problem until now was cross compatibility regarding peripherals, which complicates running a common OS on ARM chips from different vendors. There is currently a standardization effort (Arm SystemReady SR) that might help with that issue though.

Hamuko · on April 12, 2021

Based on initial testing, AWS EC2 instances with ARM chips performed as well if not better than the Intel instances, but they cost 20% less. The only drawback that I've really encountered thus far was that it complicates the build process.

_dax6 · on April 12, 2021

Does ARM have a uniquely complex build process, or is it the mix of architectures that makes it more difficult?

sumtechguy · on April 12, 2021

ARM is all over the place with its ISA. x86 has the benefit that most companies made it 'IBM compatible'. There are one off x86 ISAs but they are mostly forgotten at this point. The ARM CPU family itself is fairly consistent (mostly), but included hardware is a very mixed bag. The x86 has on the other hand the history of build it to make it work like IBM. All the way from how things boot up, memory space addresses, must have I/O, etc. ARM on the other hand may or may not have that depending on which ISA you target or are creating. Things like the raspberry PI has changed some of that as many are mimicking the broadcom ISA and specifically that with the raspberry pi one. The x86 arch has also picked up some interesting baggage along the way because of what it is. We can mostly ignore it but it is there. For example you would not build a ARM board these days with an IDE interface but some of those bits still exist in the x86 world.

ARM is more of a tool kit to build different purpose built computers (you even see them show up in usb sticks). While x86 is particular ISA that has a long history behind it. So you may see something like 'Amazon builds its own ARM computers'. That means they spun their own boards, built their own toolchains (more likely recompiled existing ones), and probably have their own OS distro to match. Each one of those is a fairly large endeavor to do. When you see something like 'Amazon builds its own x86 boards', they have shaved out the other two parts of that and are focusing on hardware. That they are building their own means they see the value in owning the whole stack. Also if you have your own distro means you usually have to 'own' building the whole thing. So I can go grab an x86 gcc stack from my repo provider. They will need to act as the repo owner and build it themselves and keep up with the patches. Depending on what has been added that can be quite the task all by itself.

Hamuko · on April 12, 2021

Mix of architectures and the fact that our normal CI server is still x86-based and really didn't want to do ARM builds.

rexreed · on April 12, 2021

Honestly the bottom down-voted comment has it right. What AI application is actually driving demand here? What can't be accomplished now (or with reasonable expenditures) that can be accomplished by this one CPU that will be released in 2 yrs? What AI applications will need this 2 yrs from now that don't need it now?

I understand the here-and-now AI applications. But this is smelling more like Big AI Hype than Big AI need.

gwern · on April 13, 2021

Huang said "We expect to see multi-trillion-parameter models by next year, and 100 trillion+ parameter models by 2023". He probably knows more about what AI applications there are than you do, and spends a large chunk of the keynote discussing many applications.

cracker_jacks · on April 12, 2021

"640K ought to be enough for anybody."

wmf · on April 12, 2021

GPT-4 and GPT-5.

nabla9 · on April 12, 2021

Finally news from Nvidia that really moved markets.

  Nvidia +4.68%, 
  Intel  -4.65% 
  AMD    -4.47%

01100011 · on April 12, 2021

I wonder how permanent this is. As a Nvidian who sells his shares as soon as they vest and who owns some Intel for diversification, I wonder if I should load up on Intel? You really can't compete with their fab availability. Having a great design means nothing unless you can get TSMC to grant you production capacity.

nabla9 · on April 12, 2021

TSMC takes orders years ahead and builds capacity to match working together with big customers. Those who pay more (price per unit and large volume) get first shot. That's why Apple is always first, followed by Nvidia and AMD, then Qualcomm.

There is bottled demand because Intel's failure to deliver was not fully anticipated by anyone.

crb002 · on April 12, 2021

+1 ECC RAM

DonHopkins · on April 12, 2021

I love the name "Grace", after Grace Hopper.

paulmd · on April 12, 2021

There's a tendency to use first names to refer to women in professional settings or political power that is somewhat infantilizing and demeaning.

I doubt anyone really deliberately sets out to be like "haha yessss today I shall elide this woman's credentials", but this is one of those unconscious gender-bias things that is commonplace in our society and is probably best to try and make a point of avoiding.

https://news.cornell.edu/stories/2018/07/when-last-comes-fir...

https://metro.co.uk/2018/03/04/referring-to-women-by-their-f...

(etc etc)

I'd prefer they used "Hopper" instead, in the same way they have chosen to refer to previous architectures by the last names of their namesakes (Maxwell, Pascal, Ampere, Volta, Kepler, Fermi, etc). I'd see that as being more professionally respectful for her contributions.

But yes I very much like the idea of naming it after Hopper.

bloak · on April 12, 2021

Perhaps you're being downvoted because it's a tangent. It's a real phenomenon, though, and an interesting one. Of course there are many things that influence which parts of someone's full name get used, and if the tendency is a problem it's a trivial one compared to all the other problems that women face, but, yes, in general it would probably be a good idea to be more consistent in this respect.

Vaguely related: J. K. Rowling's "real" full name is Joanne Rowling. The publisher "thought a book by an obviously female author might not appeal to the target audience of young boys".

There's another famous (in the UK at least) computer scientist called Hopper: Andy Hopper. So "G.B.M. Hopper", perhaps? That would have more gravitas than "Andy"!

trynumber9 · on April 12, 2021

Hopper was already reserved for an Nvidia GPU: https://en.wikipedia.org/wiki/Hopper_(microarchitecture)

paulmd · on April 12, 2021

Yeah, I dunno what is going on with that, I assumed that had changed if they were going to use the name "grace" for another product.

I guess I'm not sure if "Hopper" refers to the product as a whole (like Tegra) and early leakers misunderstood that, or whether Hopper is the name of the microarchitecture and "Grace" is the product, or if it's changed from Hopper to Grace because they didn't like the name, or what.

Otherwise it's a little awkward to have products named both "grace" and "hopper"...

adrian_b · on April 13, 2021

I do not believe that referring to women using the first names is somewhat infantilizing and demeaning.

Unfortunately, at least in most Western societies, using the first names is the only way to refer unambiguously to women.

According to the tradition, in most Western countries the women do not have their own family names, but use either the family name of their father until marriage, or the family name of their husband after that.

So while Grace is the computer scientist, Hopper is her husband and Murray is her father. Using the name Grace makes clear who is honored.

Nowadays, in many places there are laws that allow women to choose their family names or to combine the family names.

Nevertheless, the old tradition is still entrenched, so searching for a certain woman, when the last information about her is many years old, can be difficult due to unpredictable family name changes.

Ideally, a human should keep forever the family name used at birth and the parents should choose one of their family names for the children.

ashtonbaker · on April 13, 2021

So to be clear, "Hopper" would unambiguously refer to Vincent Foster Hopper in this context, and not famed computer scientist Grace Hopper? Not Vincent Foster's father? What if he was adopted and began life with a different family name? Why make this distinction specifically for women, so that a last name cannot possibly refer to them?

pezezin · on April 13, 2021

   Ideally, a human should keep forever the family name used at birth and the parents should choose one of their family names for the children.

I prefer the Spanish way, have two family names. We have been doing it for centuries, it baffles me that other countries find it so difficult to adopt a similar system.

hderms · on April 12, 2021

I feel like there's a non-zero chance they named it Grace instead of Hopper so their new architecture doesn't sound like a bug or a frog or something. You could be right, though

TheMagicHorsey · on April 12, 2021

Is anyone but Apple making big investments in ARM for the desktop? This is another ARM for the datacenter design.

If other companies don't make genuine investments in ARM for the desktop there's a real chance that Apple will get a huge an difficult to assail application performance advantage as application developers begin to focus on making Mac apps first, and port to x86 as an afterthought.

Something similar happened back in the day when Intel was the de facto king, and everything on other platforms was a handicapped afterthought.

I wouldn't want to have my desktops be 15 to 30% slower than Macs running the same software, simply because of emulation or lack of local optimizations.

So I'm really looking forward to ARM competition on the desktop.

callesgg · on April 12, 2021

Super parallell arm chips could that not be a future thing for nvidia or another chip manufacturer. A normal CPU die with thousands of independent Cores.

astrange · on April 13, 2021

That's Xeon Phi (formerly known as Larrabee) but in general this isn't that useful. Or rather, when it is useful it's called a GPU.

de6u99er · on April 12, 2021

Don't know if it's just me but this product looks like a beta-product for early adopters.

rektide · on April 12, 2021

It's initially for two huge HPC systems. It'll be interesting to see what kind of availability it ever has to the rest of the world.

1MachineElf · on April 12, 2021

I wonder what percentage of it's supported toolchain components will be proprietary.

gradschoolfail · on April 12, 2021

If the next one is Jean or Ada we know they took it from a google search.

legulere · on April 12, 2021

Big Data, Big AI, what's next? Big Bullshit?