Given that there are essentially no architectural details here other than bandwidth estimates, and the release timeline is in 2023, how exactly does this count as "unveiling"? Headline should read: "NVidia working on new arm chip due in two years", or something else much more bland.
Not quite. CSCS supercomputing center in Switzerland have already started receiving the hardware (https://www.cscs.ch/science/computer-science-hpc/2021/cscs-d...). Perhaps, we may see some benchmarks. To wider HPC users, it will be only available in 2023 as the article mentioned.
The Alps system at CSCS will have racks with different processors, to be installed in phases. CSCS has taken delivery of the first racks with AMD EPYC processors, for non-GPU workloads. CSCS will be one of the first customers to get their hands on Grace Hopper, but they will have to wait until 2023.
Based on Future ARM Neoverse, so basically nothing much to see here from CPU perspective, What really stands out, are those ridiculous number from its Memory system and Interconnect.
CPU: LPDDR5X with ECC Memory at 500+GB/s Memory Bandwidth. ( Something Apple may dip into. R.I.P for Mac with upgradable Memory )
GPU: HBM2e at 2000 GB/s. Yes, three zeros, this is not a typo.
NVLink: 500GB/s
This will surely further solidify CUDA dominance. Not entirely sure how Intel's XE with OneAPI and AMD's ROCm is going to compete.
I think what you’re missing here is the NVLink part. The fact that you can get a small cluster of these linked up like that for 400k, all wrapped in a box, makes HPC quite a bit more accessible. Even 5 years ago, if you wanted to run a regional sized weather model at reasonable resolution, you needed to have some serious funding (say, nation states or oil / insurance companies). Nowadays you could do it with some angel investment and get one of these Nvidia boxes and just program them like they’re one GPU.
Yep, we generally care about growing a few bandwidth #'s over current:
- GPU<>CPU/RAM
- GPU<>storage
- GPU<>network
(- GPU<>GPU bandwidth is already insane, as is GPU compute speed)
In the above, they're about cases like logs where there is ~infinite off-GPU data (S3, storage, ...), yet current PCI etc CPU stuff is like a tiny straw clogging it all.
It's now ~easy to do stuff like regex search on GPU, so systems being redesigned to quickly shove 1TB through a python 1-liner is awesome.
To get a feel for where all this is in practice, I did a fun talk w/ the pavilion team for this year's GTC on building graphistry UIs & interactive dashboards on top of this: https://pavilion.io/nvidia/
Edit: A good search term here is 'GPU Direct Storage', which is explicitly about skipping the CPU bandwidth indirection & performance handcuffs. Tapping directly into the network or storage is super exciting for matching what the compute tier can do!
Critically it's CPU to GPU NVLink here, not the "boring" GPU to GPU NVLink that's common on Quadros. 500GB/s bandwidth between CPU & GPU massively changes when & how you can GPU accelerate things, that's a 10X difference over the status quo.
Also "cpu->cpu" NVLink is interesting. Though it was my understanding that NVLink is point-to-point, and would require some massive switching system to be able to access any node in the cluster anywhere near that rate without some locality bias (IE nodes on the "first" downstream switch are faster to access and less contention)
The fact that they are using a Neoverse core licensed from ARM seems to imply that there won’t be another generation for NVidia’s Denver/Carmel microarchitectures. Somewhat of a shame, because those microarchitectures were unorthodox in some ways, and it would have been interesting to see where that line of evolution would have lead.
I believe this leaves Apple, ARM, Fujitsu, and Marvell as the only companies currently designing and selling cores that implement the ARM instruction set. That may drop to 3 in the next generation, since it’s not obvious that Marvell’s ThunderX3 cores are really seeing enough traction to be be worth the non-recurring engineering costs of a custom core. Are there any others?
I think Apple did Arm an unbelievable favor by absolutely trouncing all CPU competitors with the M1. By being so fast, Apple's chip attracts many new languages and compiler backends to Arm that want a piece of that sweet performance pie. Which means that other vendors will want to have arm offerings, and not, e.g. RISCv5.
I have no idea what Apple's plans for the M1 chip are, but if they had manufacturing capacity, they could put oodles of these chips into datacenters and workstations the world over and basically eat the x86 high-performance market. The fact that the chip uses so little power (15W) means they can absolutely cram them into servers where CPUs can easily consume 180W. That means 10x the number of chips for the same power, and not all concentrated in one spot. A lot of very interesting server designs are now possible.
I think you are half right in the sense that people now know Intel architectures are not what they want/need. Riscv5 chipsets will take a bit longer to mature but can in principle do the same kinds of things that Apple is doing with M1 to keep energy usage low and throughput high. However, the key selling feature with RiscV5 is reduced IP licensing needs (cost).
With Nvidia, buying Arm and producing their own chip sets, that's no small advantage for companies that are not Nvidia (or Apple who have a perpetual license already). If I were Intel, that's what I'd be looking at right now. Same for perhaps AMD. The clock is ticking on their x86 only strategy and it takes time to develop new architectures; even if you do license somebody else's instruction set.
A counter argument to this would be software compatibility. Most of the porting effort to make linux, windows, and mac os run on Arm has already happened years ago. It's a mature software ecosystem. Software is actually the hardest part of shipping new hardware architectures. Without that, hardware has no value.
And a counter argument to that is that Apple is showing instruction set emulation actually works reasonably well: it is able to run x86 software at reasonable performance on the M1. So, running natively matters less these days. If you look at Qemu, they have some interesting work going on around e.g. emulated GPU where the goal is not to emulate some existing GPU but to create a virtual only GPU device called Virgil 3D that can run efficiently on just about anything that supports opengl. Don't expect to set fps records of course. The argument here is that the software ecosystem is increasingly easy to adapt to new chip architectures as a lot of stuff does not require access to bare metal. Google uses this strategy with Android: native compilation happens (mostly) just in time after you ship your app to the app store.
It's hard to imagine that until a few months ago it was very difficult to get a decent Arm desktop / laptop. I imagine lots of developers working now to fix outstanding Arm bugs / issues.
While I'm sure lots of projects have actual ARM-related bugs, there was a whole class of "we didn't expect this platform/arch combination" compilation bugs that have seen fixes lately. It's not that the code has bugs on ARM, a lot of OSS has been compiling on ARM for a decade (or more) thanks to Raspberry Pis, Chromebooks, and Android but built scripts didn't understand "darwin/arm64". Back in December installing stuff on an M1 Mac via Homebrew was a pain but it's gotten significantly easier over the past few months.
But a million (est) new general purpose ARM computers hitting the population certainly affects the prioritizing of ARM issues in a bug tracker.
When Itanium was newborn, HP enlisted my employer, Progeny, to help port applications to ia64 Linux.
Despite the fact that 64-bit Linux had been running successfully on DEC Alpha systems for years, we ran into no end of difficulty because pointers were truncated all over the place, which apparently hadn't mattered on Alpha systems.
It seems like it must have been an Endian issue, but after 20 years my memories are basically toast. I just know nearly every bug we found was pointer truncation.
A lot of hobbyist ones, e.g. But even for mainstream compilers, arm has been a second-class citizen where developers would not necessarily test on arm. E.g. I used to work on V8, and we had partners at ARM who would help support the 32- and 64-bit ports. While I often did go ahead and port my changes to arm, it wasn't always required, as they could do heavy lifting and debugging for us, sometimes. We didn't have arm hardware on our desks to test; V8 literally has its own CPU simulators built into it, just for running the generated code from its own JITs. We had good regression testing infrastructure, but there is nothing quite like having first-class, on-desk hardware to test with, preferrably to develop directly on.
They are licensing ARM cores; which as of now cannot compete with Apple silicon.
While there are using some future ARM core, and I've read rumors that future designs might try to emulate what has made Apple cores successful; we cannot say whether Apple designs will stagnate or continue to improve at current rate.
There is potential for competition from Qualcomm after their Nuvia acquisition though.
Maybe not in single threaded performance, but Apple has no server grade parts. Ampere, for example, is shipping an 80 core ARM N1 processor that puts out some truely impressive multithreaded performance. An M1 Mac is an entirely different market - making a fast 4+4 core laptop processor doesn't neccesarily translate into making a fast 64+ core server processor.
To be honest it does though. You could take 10 M1 chips (40+40 cores, with around 30TFLOPS of GPU) put them into a server and even at full load you would be at 150W, which is about half of the high core count Xeons. Obviously not as simple as that, but the thermal fundamentals are right.
The 40 core Xeon also costs around 10k.
There's rumors that the new iMac will have a 20 core M1 (16+4). I imagine that will be faster than even the top line $10k Xeon.
I have absolutely no doubt apple could put together a server based on the M1 which would wipe the floor with Intel if they wanted to. But I very much doubt they will since it is so far out of their core competencies these days.
I have absolutely no doubt apple could produce a ridiculously good server CPU from the M1. I doubt they will actually do it though.
> You could take 10 M1 chips (40+40 cores, with around 30TFLOPS of GPU) put them into a server
Not really, part of why the ARM chips are so good is that the memory bandwidth is so fast. With 40+40 cores you're going to have at least NUMA to contend with, which always hampers multithreaded performance.
if they could easily do that(competitor for xeon) they would do it, it's a huge and very stable market, there is no reason to ignore it, if you have such a big advantage.
It seems weird to me to say that arm cores can't compete with apple silicon given that apple doesn't own fabs. They are using arm cores on TSMC silicon (exactly the same as this).
> They are using arm cores on TSMC silicon (exactly the same as this)
No the Apple Silicon chips use the arm _instruction set_ but they do not use their core design. Apple designs their core in house, much like Qualcomm does with snapdragon. Both of these companies have an architectural license which allows them to do this.
Yes they make workstations, but they don't make ARM workstations. Yet. They already have ARM chips they could use for it, but they went with x86 instead despite the fact that they have to purchase the x86 chips from their direct competitor. Also, yes, less than $100k starting price would be nice.
It'd be interesting to know if NVidia are going for an ARMv9 core, in particular if they'll have a core with an SVE2 implementation.
It may be they don't want to detract from focus on the GPUs for vector computation so prefer a CPU without much vector muscle.
Also interesting that they're picking up an arm core rather than continuing with their own design. Something to do with the potential takeover (the merged company would only want to support so many micro-architectural lines)?
This has got me wondering whether an Nvidia owned Arm could limit SVE2 implementations so as not to compete with Nvidia's GPU. That would certainly be the case for Arm designed cores - not a desirable outcome.
I doubt it, it's not like the market for acceleration is stagnant and saturated and they need to steal some marketshare points from one side to help the other.
It's all greenfield and growing so far, they'll win more by having the very best products they can make on both sides.
They have said clearly that the core is licensed from ARM and one of the Neoverse future models.
There was no information whether it will have any good SVE2 implementation. On the contrary they insisted only on the integer performance and on the high-speed memory interface.
Here's Anandtech's article on the previous Neoverse V1/N2 announcement: https://www.anandtech.com/show/16073/arm-announces-neoverse-... arm weren't saying anything official but Anandtech did a little digging and reckons V1 is SVE 1 and v8 and N2 could be Armv9 with SVE 2.
I'd suspect NVidia would be using the V1 here as it's the higher performing core, but not way to be certain.
"E" is efficiency, N is standard, V is high-speed. IIRC, N is the overall winner in performance/watt. Efficiency cores have the lowest clock speed (overall use the least amount of watts/power). V purposefully goes beyond the performance/watt curve for higher per-core compute capabilities
I think they will use SVE2 because I assume they'll need to perform vector reads/writes to NVLink connected peripherals to reach that 900GB/s GPU-to-CPU bandwidth metric they described.
So is ARM the future at this point? After seeing how well Apple's M1 performed against a traditional AMD/Intel CPU, it has me wondering. I used to think that ARM was really only suited for smaller devices.
Being ARM has something to do with it. The x86 instruction decoder may be only about ~5% of the die, but it's 5% of the die that has to run all the time. Think about how warm your CPU gets when you run e.g. heavy FPU loads and then imagine that's happening all the time. You can see the power difference right there.
It's also very hard to achieve more than 4X parallelism (though I think Ice Lake got 6X at some additional cost) in decode, making instruction level parallelism harder. X86's hack to get around this is SMT/hyperthreading to keep the core fed with 2X instruction streams, but that adds a lot more complexity and is a security minefield.
Last but not least: ARM's looser default memory model allows for more read/write reordering and a simpler cache.
ARM has a distinct simplicity and low-overhead advantage over X86/X64.
The x86 decoder is not running all the time; the uops cache and the LSD exist precisely to avoid this. With instructions fed from the decoders you can only sustain 4 instructions per cycle, while to get to 5 or 6 your instructions need to be coming from either the uops cache or the LSD. In the case of the Zen 3, the cache can deliver 8 uops per cycle to the pipeline (but the overall thoughput is limited elsewhere at 6)!
Furthermore, the high-performance ARM designs, starting with the Cortex-A77, started using the same trick---the 6-wide execution happens only when instructions are being fed from the decoded macro-op cache.
The decoder might not be running strictly all the time, but I would wager that for some applications at least it doesn't make much of a difference. For HPC or DSP or whatever where you spend a lot of time in relatively dense loops the uop cache is probably big enough to ease the strain on the decoder, but for sparser code (Compilers come to mind, lots of function calls and memory bound work) I wouldn't be surprised if it didn't make as much difference.
I have vTune installed so I guess I could investigate this if I dig out the right PMCs
The LSD is disabled in this chip (Skylake) due to errata, but we can see only 1/5th of the uops come from the uops cache. However, the more relevant experiment in terms of power is how many cycles is the cache active instead of the decoders:
How can you run 8 instructions at the same time if you only have 16 general purpose registers? You’d have to either be doing float ops or constantly spilling. So I’m integer code, how many of those instructions are just moving data between memory and registers (push/pop?).
I’d say ARM has a big advantage for instruction level parallelism with 32 registers.
Okay fair. But the bigger subject is inherent performance advantage of the architecture. You don’t just want to decode many instructions per cycle, you also want to issue them. So decoding width and issuing width are related.
And it seems to me that ARM has an advantage here. If you want execute 8 instructions in parallel, you gotta actually have 8 independent things that need to get executed. I guess you could have a giant out of order buffer, and include stack locations in your register renaming scheme, but it seems much easier to find parallelism if a bunch of adjacent instructions are explicitly independent. Which is much easier if you have more registers - the compiler can then help the cpu keeping all those instruction units fed.
You seem to have several fairly fundamental misunderstandings about CPUs at a low level.
> include stack locations in your register renaming scheme
Registers aren't related to the stack. "The" stack is just RAM being accessed in a specific cache friendly pattern, with additional optimizations (if you use specific registers) from the hardware in the form of the stack engine. The compiler explicitly loads and stores to and from the registers named by the ISA. Register renaming has absolutely nothing to do with the stack.
When the CPU can tell that a later instruction doesn't depend on the previous value of a register, it's free to rename it. The result is that two independent registers get used even though only one was ever directly referenced. In reality, there are a _huge_ number of registers available on modern processors. Estimates place Skylake, Zen, and Cortex-X1 at 200+, with the M1 at 600+. The ISA just doesn't provide a way to access them directly. (If you want to read about this, the term to look up is reorder buffer.)
Also, there is a giant out of order buffer for stores waiting to be written back to L1. That buffer does indeed have to keep track of cache locations, which directly map to memory addresses, which sometimes happen to refer to stack locations. So in a sense, what you suggested already exists. (If you want to read about this, the term to look up is store buffer.)
> it seems much easier to find parallelism if a bunch of adjacent instructions are explicitly independent
That would indeed make things simpler in some cases. However, many operations such as loading a value into a register (ex mov, [addr]) or zeroing it (ex xor eax, eax) explicitly break the dependency chain by definition. Cases where the CPU fails to properly account for this are documented as false dependencies.
> the compiler can then help the cpu keeping all those instruction units fed
The "compiler handles ordering" thing was tried with Itanium. It seems it didn't go so well.
The CPU is free to simultaneously load two different pieces of data into the "same" register and execute two independent instruction streams on that "single" register thanks to renaming. Speculative execution helps when the CPU can't be completely certain that there isn't a dependency.
For particularly complicated sequences, the compiler spilling due to running out of named registers could indeed pose an issue. However, the CPU is free to elide a store followed by a load if it determines that the address is the same. (If you want to read about this, terms to look up include store-to-load forwarding and load-hit-store.)
If you elide a store followed by a load, you can effectively treat memory as registers and include them in your renaming scheme.
I know Itanium didn’t work - but that’s because here the compiler is supposed to do all the reordering work. That’s different from allowing the compiler to explicitly define that instructions are independent by having more registers.
The operations are somewhat different though. Store-to-load forwarding is more complicated and doesn't completely eliminate the operation, it just significantly reduces the cycle count when successful.
Since all registers are used, and all but two instructions are dependent, in the assembly the blocks have to follow one another. There`s also spilling of the b,c,d variables, they have to be read from registers (which could be elided). Assuming no re-order buffer, these instructions runs in three cycles (the first two are independent) - even though the top level instructions are independent.
If you want them to run all statements with 4 instructions at a time, you need to have a reorder buffer that covers the whole sequence (12 instructions). (Imagine if b,c,d get modified inside the inner loop and spilled into memory, you have to track memory locations in order to do register renaming.)
Now lets assume you have 6 registers. Now all variables fit in registers and the compiler can easily interleave the code giving a sequence of 3 or 4 independent instructions at a time. If you want to run 4 instructions at the same time, you need no reorder buffer.
This is a kind of specific example, but it shows that if you have more registers (i.e. ARM vs x86), the compiler can more easily interleave instructions, which can help reduce the number of instructions that need to be in the reorder buffer. Or with the same size re-order buffer, its easier to find more independent instructions and keep all the execution units fed. Or, when jumping to some code thats not in pipeline or icache, it allows to sooner run more instructions in parallel, when only a small number of instructions are decoded and in the re-order buffer.
I really don't see what you're getting at here. Even limited to only three named registers I don't think the example you provided would pose an issue on x86. (I'm not very familiar with ARM but I don't think it would pose any issue there either.)
In practice, x86_64 works just fine for HPC number crunching code. Outside of some serious number crunching, when are you going to have more live values than named registers, have instruction streams whose output depends on _all_ of those values (which is why they would be live), and also those streams complete so quickly that you stall on the next set of loads? And you have absolutely no other useful work to do? Honestly I think you're being silly.
Historically, I understand that the 32 bit version of x86 did have scheduling challenges surrounding function calls. The 64 bit version of the ISA expanded the number of named registers and (as far as I understand things) it largely resolved the issue.
Also note that typical hardware can sustain a surprisingly large number of loads per clock. You just need to find something useful to do while you wait for the load to complete. In case you really can't there's also SMT. Really though, the PRF and ROB are only so large.
> If you want to run 4 instructions at the same time, you need no reorder buffer.
You always need a reorder buffer if you want to achieve good performance. Among other issues, the compiler can't predict the latency for each load in advance due to caching behavior depending on the runtime state of the full computer system. I previously mentioned Itanium. It's directly relevant here.
> Imagine if b,c,d get modified inside the inner loop and spilled into memory, you have to track memory locations in order to do register renaming.
No. You can't just rename registers any longer. A store to memory means the memory model for the ISA gets involved. Things become significantly more complicated. The store buffer exists specifically to deal with such issues efficiently on an OoO core. Seriously, go read about it. It's astoundingly complicated for any OoO core regardless of the ISA.
> the compiler can more easily interleave instructions, which can help reduce the number of instructions that need to be in the reorder buffer
Unless I have a serious misunderstanding (I don't design hardware, so I might) everything passes through the reorder buffer. Every instruction is speculative until all previous instructions have retired. (https://news.ycombinator.com/item?id=20165289)
Much less. x86 instruction decoding is complicated by the fact that instructions are variable-width and are byte-aligned (i.e. any instruction can begin at any address). This makes decoding more than one instruction per clock cycle complicated -- I believe the silicon has to try decoding instructions at every possible offset within the decode buffer, then mask out the instructions which are actually inside another instruction.
ARM A32/A64 instruction decoding is dramatically simpler -- all instructions are 32 bits wide and word-aligned, so decoding them in parallel is trivial. T32 ("Thumb") is a bit more complex, but still easier than x86.
I totally agree with the core of your argument (aarch64 decoding is inherently simpler and more power efficient than x86), but I'll throw out there that it's not quite as bad as you say on x86 as there's some nonobvious efficiencies (I've been writing a parallel x86 decoder).
What nearly everyone uses is a 16 byte buffer aligned to the program counter being fed into the first stage decode. This first stage, yes has to look at each byte offset as if it could be a new instruction, but doesn't have to do full decode. It only finds instruction length information. From there you feed this length information in and do full decode on the byte offsets that represent actual instruction boundaries. That's how you end up with x86 cores with '4 wide decode' despite needing to initially look at each byte.
Now for the efficiencies. Each length decoder for each byte offset isn't symmetric. Only the length decoder at offset 0 in the buffer has to handle everything, and the other length decoders can simply flag "I can't handle this", and the buffer won't be shifted down past where they were on the next cycle and the byte 0 decoder can fix up any goofiness. Because of this, they can
* be stripped out of instructions that aren't really used much anymore if that helps them
* can be stripped of weird cases like handling crazy usages of prefix bytes
* don't have to handle instructions bigger than their portion of the decode buffer. For instance a length decoder starting at byte 12 can't handle more than a 4 byte instruction anyway, so that can simplify it's logic considerably. That means that the simpler length decoders end up feeding into the higher stack up full decoder selection, so some of the overhead cancels out in a nice way.
On top of that, I think that 5% includes pieces like the microcode ROMs. Modern ARM cores almost certainly have (albeit much smaller) microcode ROMs as well to handle the more complex state transitions.
Once again, totally agreed with your main point, but it's closer than what the general public consensus says.
I wonder whether a modern byte-sized instruction encoding would sort of look like Unicode, where every byte is self synchronizing... I guess it can be even weaker than that, probably only every second or fourth byte needs to synchronize.
Honestly, I think modern (meaning wide, multiple instruction decoders, and designed today without back compat concerns) and byte-sized are sort of mutually exclusive. Most of those ISAs were designed around 8-bit data buses, and having simple ops only consume a single memory read cycle was pretty paramount to competitive performance. Without that constraint, there's probably better options.
IMO, you would either go towards bitaligned instructions like the iAPX 432 or the Mill, or 16-bit aligned variable width instructions like the s360 and m68k on the CISC side, and ARM Thumb and RV-C on the RISC side.
That being said, you're definitely thinking about it the right way. Modern Istream bandwidth conscious ISAs absolutely (and perhaps unsurprisingly) look at the problem from a constrained, poor man's huffman encoding perspective similar to how UTF-8 was conceived.
Interestingly Thumb2 was dropped when going from Arm32 to Arm64. Perhaps the encoding was getting really complicated, and would've been even harder with 32 registers, and not being able to save a lot of memory (if many instructions use 4 bytes anyway).
Maybe one could come up with an instruction encoding that encodes some number of instructions per cache line. Every time the cpu jumps to a new instruction (at cache line address + index), the whole cache line needs to be loaded into icache anyway, and could get decoded then -> internally they get represented in microcode anyway.
> x86 instruction decoding is complicated by the fact that instructions are variable-width and are byte-aligned (i.e. any instruction can begin at any address).
This is also not a good security property since it means you can hide secret instructions in a program by jumping into the middle of innocuous ones.
> ARM A32/A64 instruction decoding is dramatically simpler -- all instructions are 32 bits wide and word-aligned, so decoding them in parallel is trivial. T32 ("Thumb") is a bit more complex, but still easier than x86.
A64 doesn't have a Thumb equivalent, also, and supporting A32/T32 is optional.
I'm sure ARM already took over x86 if you have a wider definition of personal computers. And a lot of people already gave up access to 3 decades of Windows software by using their phone or tablet as their main device.
Plus, most of the last decade software is software that runs on some sort of VM or another (be it JVM, CLR, a Javascript engine or even LLVM).
Soon (in years), x86 will only be needed by professionals that are tied to really old software. And those particular needs will probably be satisfied by decent emulation.
LLVM isn't actually a VM, it's a compiler IR with good marketing. LLVM programs are architecture specific, although of course ARM64 and x86-64 are pretty similar.
I've seen things like this a lot, and it's a bit confusing. If parts of the M1's performance are due to throwing compute at the problem, why hasn't Intel been doing that for years? What about ARM, or the M1, allowed this to happen?
Intel has. Many M1 design choices are fairly typical for desktop x86 chips, but unheard of with ARM.
For example, the M1 has 128 bit wide memory. This has been standard for decades on the desktop(dual channel), but unheard of in cellphones. The M1 also has similar amounts of cache to the new AMD and Intel chips, but thats several times more than the latest snapdragon. Qualcomm also doesn't just design for the latest node. Most of their volume is on cheaper, less dense nodes.
So from this (and some other places), it kind of seems like ARM has been competitive for a long time, but for power and temperature saving its been fighting with one hand behind its back. That's intriguing in its own right, but I'm still confused as to what the actual differences are. Like the M1 runs as fast as current gen x86 processors, while running cooler. How?
> Like the M1 runs as fast as current gen x86 processors, while running cooler. How?
The M1 is one "node" ahead. Apple forked out the cash to get all their chips on TSMC's 5nm process. This is about 2 years of advancement over the 7nm process AMD pays TSMC for. Intel's latest 10nm node is similarly behind TSMC 5nm.
Semiconductors are tricky. Small performance gains take large increases in power. If you play with overclocking, you'll learn power increases quadratically or even cubically with clocks. The mere "2nm" shrink may seem inconsequential, but for these iso-perfomance comparisons(performance@constant-thermals), it is key.
All this to say, you get what you pay for. Chips can get the same performance on TSMC's 5nm node while using 70% of the power as chips on the 7nm node.[1] Compared to TSMC's 10nm (similar to Intel's popular 14nm still in production), 5nm chips can be expected to use ~45% of the power.
Hopefully that shed some light on the M1's biggest advantage for you.
Buying the majority of TSMC's 5nm process output helped. It's a combination of good engineering, the most advanced process, and intel shitting themselves I would say.
Intel Tigerlake and Amd Renoir both support 128bit LPDDR4x at 4266MHz. Maybe you're confusing the desktop chips that use conventional DDR4? The M1 isn't competitive with them.
It will come down entirely to who can sustain a good CPU core.
Currently Apple is the only company making performance-competitive ARM cores that can make a reasonable justification for an architecture switch.
Otherwise AMD's CPUs are still ahead of everyone else, including all other ARM CPU cores not made by Apple. And even Intel is still faster in places where performance matters more than power efficiency (eg, desktop & PC gaming)
Arm's Neoverse cores are doing pretty well in the datacenter space — on AWS, the Graviton2 instances are currently the best ones for lots of use cases. It's clear that core designs by Arm are really good. The problem currently is the lag between the design being done and various vendors' chips incorporating it.
upd: oh also in the HPC world, Fujitsu with the A64FX seems to be like the best thing ever now
Graviton2 is competitive sometimes with Epyc, but
also falls far behind in some tests (eg, Java performance is a bloodbath). Overall across majority tests, Neoverse consistently comes up short of Milan even when Neoverse is given a core-count advantage. And critically the per-core performance of Graviton2 / Neoverse is worse, and per-core performance is what matters to consumer space.
But it can't just be competitive it needs to be significantly better in order for the consumer space to care. Nobody is going to run Windows on ARM just to get equivalent performance to Windows on X86, especially not when that means most apps will be worse. That's what's really impressive about the M1, and so far is very unique to Apple's ARM cpus.
> oh also in the HPC world, Fujitsu with the A64FX seems to be like the best thing ever now
A64FX doesn't appear to be a particularly good CPU core, rather it's a SIMD powerhouse. It's the AVX-512 problem - when you can use it, it can be great. But you mostly can't, so it's mostly dead weight. Obviously in HPC space this is different scenario entirely, but that's not going to translate to consumer space at all (and it's not an ARM advantage, either - 512bit SIMD hit consumer space via x86 first with Intel's Rocket Lake).
Not sure why you're placing so much weight on Epyc outperforming Graviton but discounting designs / use cases where Arm is clearly now better. Plus it's clear that we are just at the beginning of a period where some firms with very deep pockets are starting to invest seriously in Arm on the server and the desktop.
If x64 ISA had major advantages over Arm then that would be significant, but I've not heard anyone make that case: instead it's a debate about how big the Arm advantage is.
Can x64 remain competitive in some segments: probably and inertia will work in its favour. I do think it's inevitable that we will see a major shift to Arm though.
so then we think about what makes Apple's M1 so good. one hard-to-replicate factor is that they designed their hardware and software together, the ops which MacOS uses often are heavily optimized on chip.
but one factor that you can replicate is colocating memory, CPU, and GPU, the system-on-chip architecture. that's what Nvidia looks to be going after with Grace, and I'm sure they've learned lessons from their integrated designs e.g. Jetson. very excited to see how this plays out!
> one hard-to-replicate factor is that they designed their hardware and software together, the ops which MacOS uses often are heavily optimized on chip.
Not really, they are still just using the same ARM ISA as everyone else. The only hardware/software integration magic of the M1 so far seems to be the x86 memory model emulation mode, which others could definitely replicate.
> but one factor that you can replicate is colocating memory, CPU, and GPU, the system-on-chip architecture.
Amazons ARM chips are performance competitive as well, for many workloads you can expect at least similar performance per core at the same clock speed.
AWS has Mac mini, and is expected to add M1 mini into the mix [1]. I expect Apple to take lots of silicon design into data centers and edge computing. Over time I can see a lot of mobile apps running backend through Apple silicon with a full Apple cloud software stack to provide data management around security and privacy.
The instruction set doesn't make a significant difference technically, the main things about them are monopolies (patents) tied to ISAs, and sw compatibility.
I'm interested in your thoughts on why this doesn't make a significant difference. From what I've read, the M1 has a lot of tricks up its sleeve that are next to impossible on X86. For example ARM instructions can be decoded in parallel.
Instruction decoding is more power efficient on arm, but x86 has solved it as a perf bottleneck, with the trace/uop caches and by doing some speculative work in the decoders. (Parallel decoding is also old hat and not a M1 or ARM land invention, it's trivial with RISC style insn format.). What other tricks do you have in mind?
More broadly, as to why the ISA doesn't make a big difference: The major differences are at the microarchitecture level since OoO processors have such flexible dataflow machinery in them that you can kind of view the frontend as compiler technology. x86 and ARM are decades-old ISAs that have seen a many many rounds of iteration in form of added instructions and even backwards incompatible reboots at the 64-bit transition points so most hinderances have been fixed.
In the olden days ISAs were important because processors were orders of magniture simpler, and instructions were processed as-is very statically (to the point that microarchitectural artifacts like branch delay slots were enshrined in some ISAs). This meant that eg the complexity of individual instructions could a bottleneck to how fast a chip could be clocked. Or in CISC land your ISA might have been so complex that the CPU was a microcoded implementation of the ISA and didn't have any hardwired fast instructions...
ARM is the present, RISC-V is the future and Intel is the past.
The magic of Apple's M1 comes from the engineers who worked on the CPU implementation and the TSMC process.
The architecture has some impact on performance but I think it is simplicity and and ease of implementation that factors most into how well it can perform (as per the RISC idea). In that sense Intel lags for small, fast and efficient processors because their legacy architecture pays a penalty for decoding and translation (into simpler ops) overhead. Eventually designs will abandon ARM for RISC-V for similar reasons as well as financial ones.
Really, today it's a question of who has the best implementation of any given architecture.
Tangent: Apple should bring back the Xserve with their M1 line, or alternately license the M1 core IP to another company to produce a differently-branded server-oriented chip. The performance of that thing is mind blowing and I don't see how this would compete with or harm their desktop and mobile business.
The cheapest available Epyc (7313P) has 16 cores and dual socket systems have up to 128 cores and 256 threads. Server workloads are massively parallel, so a 4+4 core M1 would be embarrassed and Apple wouldn't want to subject themselves to that comparison.
But another reason they won't do it is that TSMC has a finite amount of 5nm fab capacity. They can't make more of the chips than they already do.
A 4+4 core M1 is 16 billion transistors. Some of that is the little cores, GPU, etc, but its not clear to me its practical to get, say 8x larger. That would be 128 billion transistors. As a point of comparison, NVIDIA's RTX 3090 is 28B transistors, and thats a huge, expensive chip.
I am also hoping for a return of the Xserve once Apple makes high-corecount variations of the Apple Silicon for the Mac Pro. This would have several benefits. First of all, it would greatly increase the production count of that variant, it could be too expensive to make such a chip just for the Mac Pro. In any case, it should be cheaper than an equivalent Intel CPU as Apple would not have to pay for Intels profits. And finally, just the power savings for the vast compute centers Apple operates should mean a lot of money saved too.
How much of that performance is on-chip memory and how usable/scalable is that? An Xserve that is limited to one CPU and can't have more RAM would pretty mediocre.
IBM have basically hollowed out their team, so I'd say it's IBM ditching the market more than anything... our centre would not now consider POWER even though we currently have nodes.
Well, PCIe 6 x16 will do 128 GB/s. Of course, the real question is how many transactions per second you get. For the PCIe 6 16 lanes it's about 64 GT/s.
Speaking in general terms, data rate and transaction rate don't necessarily match because a transaction might require the transmitter to wait for the receiver to check packet integrity and then issue acknowledgement to the transmitter before a new packet can be sent.
Yet another case, again, speaking in general terms, would be the case of having to insert wait states to deal with memory access or other processor architecture issues.
Simple example, on the STM32 processor you cannot toggle I/O in software at anywhere close to the CPU clock rate due to architectural constraints (to include the instruction set). On a processor running at 48 MHz you can only do a max toggle rate of about 3 MHz (toggle rate = number of state transitions per second).
> Speaking in general terms, data rate and transaction rate don't necessarily match because a transaction might require the transmitter to wait for the receiver to check packet integrity and then issue acknowledgement to the transmitter before a new packet can be sent.
PCIe has the optional "relaxed ordering" feature, allowing sending new packets before the ACK has been received from preceeding ones. Not sure precisely how this works, if there is some TCP-like window scaling algorithm in play or not..
Well, according to [1], NVIDIA lists NVLink 3.0 as being 50 Gb/s per lane per direction, and lists the total maximum bandwidth of NVSwitch for Ampere (using NVLink 3.0) as 900 GB/s each direction, so it doesn't seem completely out of reach.
Fascinatingly, NVIDIA's own docs [1] claim GPU<->GPU bandwidth on that device of 600 GB/s (though they claim total aggregate bandwidth of 9.6 TB/s). Which would be what, 96 and 1536 lanes, respectively? That's quite the pinout.
I like the sound of a non-Apple arm chip for workstations. Given my positive experience with the M1 I'd be perfectly happy never using x86 again after this market niche is filled.
Me too. But my decades old steam collection isn't looking forward to it. That's one advantage of cloud gaming. It won't matter what your desktop runs on.
Apple throws a lot of transistors at their 4 performance cores in the M1 to get the performance they do - its not clear that approach would realistically scale to a workstation CPU with 16, 32, or more cores (at least not with current fab capabilities).
There's a lot of interconnects (CCIX, CXL, OpenCAPI, NVLink, GenZ) brewing. Nvidia going big is, hopefully, a move that will prompt some uptake from the other chip makers. 900GBps link, more than main memory: big numbers there.
Side note, I miss AMD being actively involved with interconnects. InfinityFabric seems core to everything they are doing, but back in the HyperTransport days it was something known, that folks could build products for, interoperate with. Not many did, but it's still frustrating seeing AMD keeping cards so much closer to the chest.
lot of downvotes. anyone want to say any reason why they think this deserves a downvote? very unclear to me. do you all just not have the historical context? what's wrong here? give me some hints why you don't get what i'm saying here.
That's something weird I have noticed about HN, sometimes perfectly reasonable comments are downvoted to hell without any reply. At least in the good ol' Slashdot days you will get the reason why you got downvoted, now... nothing.
Real business-class features we want to know about:
Will they auto-detect workloads and cripple performance (like the mining stuff recently)? Only work through special drivers with extra licensing feeds depending on the name of the building it is in (data center vs office)?
You must be the only gamer in the world that wants an HBM2e GPU for gaming that's 10x more expensive while only delivering a negligible improvement in FPS.
Can the CDNA GPUs from AMD even connect to a monitor ?
I don't think they even have display ports.
Not sure what good would a "gaming driver" do you on those cards.
Same for the opposite. Do the RDNA GFX cards have even hardware for compute? They don't even have tensor cores, so why would AMD invest money into creating a compute driver for hardware that's bad at compute?
> I'm only talking about driver/license locks,
Not "locked" is a big understatement. A driver release for some hardware needs at least some QA, so the assumption that doing this is just "free" because its software is incorrect.
>A driver release for some hardware needs at least some QA, so the assumption that doing this is just "free" because its software is incorrect.
Nvidia detects mining workloads in software based on heuristics and disables them. Probably causes more support burden than less and took extra engineering time to implement, not less.
I know we are going to hear from the Apple haters soon or those that don't like what apple is doing (modular upgradeable systems going away) BUT it seems like Apple is moving in a similar direction as nvidia.
Apple is also I think going to soldered on / close in RAM. Nvidia looks to be doing this two CPU / GPU / Ram all close together and it doesn't look like any upgrade options. Some thinking was that Apple was continuing to increase durability / reliability etc with their RAM move.
Does anyone know requirements for the LPDDR5X type of ram mentioned here. Does this require soldering things (you obviously get lots more control if you spec chips yourself and solder on)?
So is ARM the future at this point? After seeing how well Apple's M1 performed against a traditional AMD/Intel CPU, it has me wondering. I used to think that ARM was really only suited for smaller devices.
Depends, performance wise it should be able to compete with or even outperform x86 in many areas. A big problem until now was cross compatibility regarding peripherals, which complicates running a common OS on ARM chips from different vendors. There is currently a standardization effort (Arm SystemReady SR) that might help with that issue though.
Based on initial testing, AWS EC2 instances with ARM chips performed as well if not better than the Intel instances, but they cost 20% less. The only drawback that I've really encountered thus far was that it complicates the build process.
ARM is all over the place with its ISA. x86 has the benefit that most companies made it 'IBM compatible'. There are one off x86 ISAs but they are mostly forgotten at this point. The ARM CPU family itself is fairly consistent (mostly), but included hardware is a very mixed bag. The x86 has on the other hand the history of build it to make it work like IBM. All the way from how things boot up, memory space addresses, must have I/O, etc. ARM on the other hand may or may not have that depending on which ISA you target or are creating. Things like the raspberry PI has changed some of that as many are mimicking the broadcom ISA and specifically that with the raspberry pi one. The x86 arch has also picked up some interesting baggage along the way because of what it is. We can mostly ignore it but it is there. For example you would not build a ARM board these days with an IDE interface but some of those bits still exist in the x86 world.
ARM is more of a tool kit to build different purpose built computers (you even see them show up in usb sticks). While x86 is particular ISA that has a long history behind it. So you may see something like 'Amazon builds its own ARM computers'. That means they spun their own boards, built their own toolchains (more likely recompiled existing ones), and probably have their own OS distro to match. Each one of those is a fairly large endeavor to do. When you see something like 'Amazon builds its own x86 boards', they have shaved out the other two parts of that and are focusing on hardware. That they are building their own means they see the value in owning the whole stack. Also if you have your own distro means you usually have to 'own' building the whole thing. So I can go grab an x86 gcc stack from my repo provider. They will need to act as the repo owner and build it themselves and keep up with the patches. Depending on what has been added that can be quite the task all by itself.
Honestly the bottom down-voted comment has it right. What AI application is actually driving demand here? What can't be accomplished now (or with reasonable expenditures) that can be accomplished by this one CPU that will be released in 2 yrs? What AI applications will need this 2 yrs from now that don't need it now?
I understand the here-and-now AI applications. But this is smelling more like Big AI Hype than Big AI need.
Huang said "We expect to see multi-trillion-parameter models by next year, and 100 trillion+ parameter models by 2023". He probably knows more about what AI applications there are than you do, and spends a large chunk of the keynote discussing many applications.
I wonder how permanent this is. As a Nvidian who sells his shares as soon as they vest and who owns some Intel for diversification, I wonder if I should load up on Intel? You really can't compete with their fab availability. Having a great design means nothing unless you can get TSMC to grant you production capacity.
TSMC takes orders years ahead and builds capacity to match working together with big customers. Those who pay more (price per unit and large volume) get first shot. That's why Apple is always first, followed by Nvidia and AMD, then Qualcomm.
There is bottled demand because Intel's failure to deliver was not fully anticipated by anyone.
There's a tendency to use first names to refer to women in professional settings or political power that is somewhat infantilizing and demeaning.
I doubt anyone really deliberately sets out to be like "haha yessss today I shall elide this woman's credentials", but this is one of those unconscious gender-bias things that is commonplace in our society and is probably best to try and make a point of avoiding.
I'd prefer they used "Hopper" instead, in the same way they have chosen to refer to previous architectures by the last names of their namesakes (Maxwell, Pascal, Ampere, Volta, Kepler, Fermi, etc). I'd see that as being more professionally respectful for her contributions.
But yes I very much like the idea of naming it after Hopper.
Perhaps you're being downvoted because it's a tangent. It's a real phenomenon, though, and an interesting one. Of course there are many things that influence which parts of someone's full name get used, and if the tendency is a problem it's a trivial one compared to all the other problems that women face, but, yes, in general it would probably be a good idea to be more consistent in this respect.
Vaguely related: J. K. Rowling's "real" full name is Joanne Rowling. The publisher "thought a book by an obviously female author might not appeal to the target audience of young boys".
There's another famous (in the UK at least) computer scientist called Hopper: Andy Hopper. So "G.B.M. Hopper", perhaps? That would have more gravitas than "Andy"!
Yeah, I dunno what is going on with that, I assumed that had changed if they were going to use the name "grace" for another product.
I guess I'm not sure if "Hopper" refers to the product as a whole (like Tegra) and early leakers misunderstood that, or whether Hopper is the name of the microarchitecture and "Grace" is the product, or if it's changed from Hopper to Grace because they didn't like the name, or what.
Otherwise it's a little awkward to have products named both "grace" and "hopper"...
I do not believe that referring to women using the first names is somewhat infantilizing and demeaning.
Unfortunately, at least in most Western societies, using the first names is the only way to refer unambiguously to women.
According to the tradition, in most Western countries the women do not have their own family names, but use either the family name of their father until marriage, or the family name of their husband after that.
So while Grace is the computer scientist, Hopper is her husband and Murray is her father. Using the name Grace makes clear who is honored.
Nowadays, in many places there are laws that allow women to choose their family names or to combine the family names.
Nevertheless, the old tradition is still entrenched, so searching for a certain woman, when the last information about her is many years old, can be difficult due to unpredictable family name changes.
Ideally, a human should keep forever the family name used at birth and the parents should choose one of their family names for the children.
So to be clear, "Hopper" would unambiguously refer to Vincent Foster Hopper in this context, and not famed computer scientist Grace Hopper? Not Vincent Foster's father? What if he was adopted and began life with a different family name? Why make this distinction specifically for women, so that a last name cannot possibly refer to them?
Ideally, a human should keep forever the family name used at birth and the parents should choose one of their family names for the children.
I prefer the Spanish way, have two family names. We have been doing it for centuries, it baffles me that other countries find it so difficult to adopt a similar system.
I feel like there's a non-zero chance they named it Grace instead of Hopper so their new architecture doesn't sound like a bug or a frog or something. You could be right, though
Is anyone but Apple making big investments in ARM for the desktop? This is another ARM for the datacenter design.
If other companies don't make genuine investments in ARM for the desktop there's a real chance that Apple will get a huge an difficult to assail application performance advantage as application developers begin to focus on making Mac apps first, and port to x86 as an afterthought.
Something similar happened back in the day when Intel was the de facto king, and everything on other platforms was a handicapped afterthought.
I wouldn't want to have my desktops be 15 to 30% slower than Macs running the same software, simply because of emulation or lack of local optimizations.
So I'm really looking forward to ARM competition on the desktop.
Super parallell arm chips could that not be a future thing for nvidia or another chip manufacturer. A normal CPU die with thousands of independent Cores.