The answer of wide decode and deep reorder buffer gets much closer than the “tricks” mentioned in tweets. That still doesn’t explain how Apple built an 8-wide CPU with such deep OOO that operates on 10-15 watts.
The limit that keeps you from arbitrarily scaling up these numbers isn’t transistor count. It’s delay—how long it takes for complex circuits to settle, which drives the top clock speed. And it’s also power usage. The timing delay of many circuits inside a CPU scare super-linearly with things like decode width. For example, the delay in the decode stage itself scales quadratically with the width of the decoder: ftp://ftp.cs.wisc.edu/sohi/trs/complexity.1328.pdf (p. 15). The delay of the issue queues is quadratic both in the issue width and the depth of the queues. The delay of a full bypass network is quadratic in execution width. Decoding N instructions at a time also requires a register renaming unit that can perform register renaming for that many instructions per cycle, and the register file must have enough ports to be able to feed 2-3 operands to N different instructions per cycle. Additionally, big, multi-ported register files, deep and wide issue queues, and big reorder buffers also tend to be extremely power hungry.
On the flip side, the conventional wisdom is that most code doesn’t have enough inherent parallelism to take advantage of an 8-wide machine: https://www.realworldtech.com/shrinking-cpu/2/ (“The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate.”). At the very least, such designs tend to be very application-dependent. Branch-y integer code like compilers tend to perform poorly on such wide and slow designs. The M1 by contrast manages to come close to Zen 3, which is already a high ILP CPU to begin with, despite a large clock speed deficit (3.2 ghz versus 5 ghz). And the performance seems to be robust—doing well on everything from compilation to scientific kernels. That’s really phenomenal and blows a lot of the conventional wisdom out of the water.
An insane amount of good engineering went into this CPU.
> An insane amount of good engineering went into this CPU.
I agree but lets not overblow the difficulties either.
> For example, the delay in the decode stage itself scales quadratically with the width of the decoder.
That could be irrelevant for small enough numbers, and ARM is easier to decode than x86. So this can be very well be dominated by other things. What you cite seems to be only about decoding logical register decoding going into the renaming structures, and then for just that tiny part it even tells that "We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width."
> The delay of a full bypass network is quadratic in execution width.
Maybe if that's a problem don't do a full bypass network.
> dropping rapidly as both a function of increasing width and increasing clock rate
Good thing that the clock rate is not too high then :p
More seriously the M1 can keep the beast fed probably because everything is dimensioned correctly, (and yes also because the clocks are not too high, but if you manage to make a wide and slow CPU that actually works well, I don't see why you would want to scale the freq too much high, given you would quickly consume like crazy, and there is only only limited headroom above 3.2GHz anyway). It obviously helps to have a gigantic OOO. So I don't really see where there is so much surprise. Esp. since we saw the progression in the A series.
To finish probably TSMC 5nm does not hurt. The competitors are on bigger nodes and have smaller structures. Coincidence? Or just like it has worked during decades already.
It's not completely groundbreaking, but painting it as an outgrowth of existing trends doesn't give Apple enough credit. The challenges of scaling wider CPUs within available power budgets is widely accepted: https://www.cse.wustl.edu/~roger/560M.f18/CSE560-Superscalar... (for "high-performance per watt cores," optimal "issue width is ~2"). Intel designed an entire architecture, Itanium, around the theory that OOO scaling would hit a point of diminishing returns. https://www.realworldtech.com/poulson ("Many of them were convinced that dynamic instruction scheduling and out-of-order execution would ultimately prove to be too complex and power hungry."). It is also well accepted that we are hitting limits on ability to extract instruction-level parallelism: https://docencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProces... ("There just aren’t enough instructions that can actually be executed in parallel!"); https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp16/cs... ("Hardly more than 1-2 IPC on real workloads").
Apple being able to wring robust performance out of an 8-wide 3.2 GHz design, on a variety of benchmarks, is impressive and unexpected. For example, the M1 outperforms a Ryzen 5950X by 15%. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste.... Zen 3 is either 4- or 8-wide decode (depending on whether you hit the micro-op cache) and boosts to 5 GHz. It beats the 10900k, a 4+1-way design that boosts to 5.1 GHz, by 25%. The GCC subtest, meanwhile, is famous for being branch-heavy code with low parallelism. Apple extracting 80% more IPC from that test than AMD's latest core (which is already a very impressive, very wide core to begin with!) is very unexpected.
A lot of the conventional wisdom is based on assumptions about branch prediction and memory disambiguation, which have major impacts on how much ILP you can extract: http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi.... To do so well, Apple must be doing something very impressive on both fronts.
The i7-1165G7 is within 20% on single core performance of the M1. The Ryzen 4800U is within 20% on multi-core performance. Both are sustained ~25W parts similar to the M1. If you turned x64 SMT/AVX2 off, normalized for cores (Intel/AMD 4/8 vs Apple 4+4), on-die cache (Intel/AMD 12M L2+L3 vs Apple 32MB L2+L3) and frequency (Apple 3.2 vs AMD/Intel 4.2/4.7), you'd likely get very close results on 5nm or 7nm-equivalent same process. Zen 3 2666 vs 3200 RAM alone is about a 10% difference. The M1 is 4266 RAM IIRC.
TBH, Laptop/Desktop level performance is indeed very nice to see out of ARM, after a few years of false starts by a few startups and Qualcomm. Apple designed a wider core they deserve credit for, but wider cores have been a definitive trend starting with the Pentium M vs Pentium 4. There is a trade-off here for die area IMO, AMD/Intel have AVX2 and even AVX512 and SMT on each core, and narrower cores (with smaller structures, higher frequency). Apple has wider cores (larger structures, less frequency, higher IPC). It's not that simple, but it kind of is if you squint a bit.
The i7-1165G7 boosts to 4.7 GHz, 50% higher than the M1. A 75% uplift in IPC (20% more performance at 2/3 the clock speed) compared to Intel’s latest Sunny Cove architecture is enormous. Especially since Sunny Cove is itself the biggest update to Intel’s architecture since Sandy Bridge a decade ago.
Like I said, this is absolutely a die-size tradeoff IMO. That 75% IPC gain is only around a ~20% difference in geekbench and at similar sustained power levels. If you want AVX2/512+SMT, a slightly narrower core of realistically 6+ wide with uOP-cache upto 8-wide is an acceptable tradeoff. We have seen Zen 3 go wider from Zen 1/2[1], so wider x64 designs with AVX/SMT should be coming, but this is the squinting part with TSMC 5nm vs 7nm.
Intel’s 10nm is equivalent to TSMC’s 7nm, so we’re just talking one generation on the process side. I don’t think you can chalk a 75% IPC gain to a die shrink. That’s a much bigger IPC uplift than Intel has achieved from Sandy Bridge to Sunny Cove, which happened over 4-5 die shrinks.
The total performance gain, comparing a 4.7 GHz core to a 3.2 GHz core, is 20%. But there is more to it than bottom line. The conventional wisdom would tell you that increasing clock speed is better than making the core wider because of diminishing returns to chasing IPC. Intel has followed the CW for generations: it has made cores modestly wider and deeper, but has significantly increased clock speed. Intel doubled the size of the reorder buffer from Sandy Bridge to Sunny Cove. Intel increased issue width from 5 to 6 over 10 years.
If your goal was to achieve a 20% speed-up compared to Sunny Cove, in one die shrink, the CW would be to make it a little wider and a little deeper but try to hit a boost clock well north of 5 GHz. It wouldn’t tell you to make it a third wider and twice as deep at the cost of dropping the boost clock by a third. Apple isn’t just enjoying a one-generation process advantage, but is hitting a significantly different point in the design space.
Superscalar vs super-pipelining isn't new. If there's no magic, then a third wider would likely exactly decrease the boost clock by a third with perfect code. With SMT off, I get 25-50% more performance on single threaded benchmarks, that's because a thread does get full access to 50% more decode/execution units in the same cycle. It's not that simple again, but that's likely the simplest example.
The M1 is definitely a significantly different point in the design space. Intel is also doing big/little designs with Lakefield, but it's still a bit early to see where that goes for x64. I don't think Intel/AMD have specifically avoided going wider as fast as Apple; AVX/AVX2/AVX512 probably take up more die-area than going 1/3 wider, and that's what they've focused on with extensions over the years. If there is an x64 ISA limitation to going wider, we'll find out, but that's highly unlikely IMO.
> Superscalar vs super-pipelining isn't new. If there's no magic, then a third wider would likely exactly decrease the boost clock by a third with perfect code.
It's not new, but it's surprising. You're correct that going a third wider at the cost of a third of clockspeed is a wash with "perfect code" but the experience of the last 10-20 years is that most code is far from perfect: https://www.realworldtech.com/shrinking-cpu/2/
> The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate. Execution efficiency (actual instruction execution rate divided by peak execution rate) dropped with increasing superscalar issue width because the amount of instruction level parallelism (ILP) in most programs is limited.... The ILP barrier is a major reason that high end x86 MPUs went from fully pipelined scalar designs to 2-way superscalar in three years and then to 3-way superscalar in another 3 years, but have been stuck at 3-way issue superscalar for the last nine years.
I agree there's probably no real x86-related limitation to going wider, if you've got a micro-op cache. As noted in the study referenced above, I suspect its the result of very good branch prediction, memory disambiguation, and an extremely deep reorder window. Each of those is an engineering feat. Designing a CPU that extracts 80% more ILP than Zen 3 in branch-heavy integer benchmarks like SPEC GCC is a major engineering feat.
Nope. Anandtech measured 27W peak on M1 CPU workloads with the average closer to 20W+[1].
The Ryzen 4800U and i7-1165G7 also have comparable GPUs (and TPU+ISP for the i7) within the same ~15-25W TDP. The Intel i7-1165G7 average TDP might be closer to ~30W because of it's 4.7Ghz boost clock, but it's still comparable to the M1.
The i7-1165G7 and 4800U have a few laptop designs with soldered RAM. You can get 17hrs+ of video out of a 4800U laptop with a 60Wh battery[2]. Also comparable with i7-1065G7/i7-1165G7 at 15hrs+/50Wh.
Wasn’t 27W for the whole Mac Mini machine using a meter at the wall plug? So that includes losses in the power supply and ssd and everything else outside the chip that uses a bit of juice whereas the AMD tdp is just the chip. I thought Anandtech said there was currently no reliable way to do an ‘apples to apples’ tdp comparison?
Edit: quote from anandtech:
“As we had access to the Mac mini rather than a Macbook, it meant that power measurement was rather simple on the device as we can just hook up a meter to the AC input of the device. It’s to be noted with a huge disclaimer that because we are measuring AC wall power here, the power figures aren’t directly comparable to that of battery-powered devices, as the Mac mini’s power supply will incur a efficiency loss greater than that of other mobile SoCs, as well as TDP figures contemporary vendors such as Intel or AMD publish.”
https://www.youtube.com/watch?v=_MUTS7xvKhQ&list=PLo11Rczpzu... Check this out, 12.5W power consumption for the M1 CPU vs. 68W CPU power consumption for the Intel i9 CPU of the 16” Macbook Pro, and yet the M1 is 8% faster in Cinebench R23 in multi-core score.
My naive assumption would be that 4c big + 4c little would perform better than 4c/8t all other things being equal (and assuming software was written to optimize for each design respectively). Also no reason you can't have 4c/8t big + 4c/8t little too.
> For example, the delay in the decode stage itself scales quadratically with the width of the decoder: ftp://ftp.cs.wisc.edu/sohi/trs/complexity.1328.pdf (p. 15).
That's a decoder for a single field, where width of the field is the parameter it scales by. That would be instruction size or smaller, and instructions don't change size depending on how many you decode at once.
And logically once you separate the instructions you can decode in parallel in fixed time, and if all your instructions are 4 bytes then it takes no circuitry to separate them.
Also: "We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the
delay of the decoder is linearly dependent on the issue width."
Your first source does not support your statement.
While there is theoretically a quadratic component, in their words:
> We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width.
> That still doesn’t explain how Apple built an 8-wide CPU with such deep OOO that operates on 10-15 watts.
Well, because it doesn't, it's ~25 watts. And also because it runs at just 3ghz. You'll see similar power numbers from x86 CPUs at 3ghz, too. The M1's multicore performance vs. the 4800U and 4900HS demonstrate this nicely.
I haven’t read the linked AnandTech article yet, but is there a clear answer why Apple was able to defy common comp arch wisdom (M1 has wider decode which works fine for various applications/code)?
Check the parent article that explains it well. Apple didn’t defy common comp arch wisdom... they applied it.
The reason why is hard for Intel/AMD to do the same is not the lack of engineering geniuses (I’m sure they have plenty), but the support for a legacy ISA, and a particular business model.
What Apple defies is common business survival instincts: why spent so much in RD of a chip if there are market leaders that are impossible to beat? The answer seems to be obvious now... but probably it wasn’t obvious when Apple acquired PA Semi in 2008.
> What Apple defies is common business survival instincts: why spent so much in RD of a chip if there are market leaders that are impossible to beat?
Having own silicon, means the upstream will not be able to turn lights on you (Samsung — a company keeping a quarter of its host country's GDP hostage.) I believe the immediate goal of PA Semi purchase was that.
> The answer seems to be obvious now... but probably it wasn’t obvious when Apple acquired PA Semi in 2008.
PA Semi was clearly a diamond in the rough. It took a great insight to single out PA Semi, because on the surface it was a very barebones SoC sweatshop, but in reality PA were the last of Mohicans of US chip design.
PA was a place where all non-Intel IC engineers left to after the severe carnage of microchip businesses of US tech giants like Sun, IBM, HP, DEC, SGI..., and etc.
It was a star team which back then was toiling at router box SoCs.
The limit that keeps you from arbitrarily scaling up these numbers isn’t transistor count. It’s delay—how long it takes for complex circuits to settle, which drives the top clock speed. And it’s also power usage. The timing delay of many circuits inside a CPU scare super-linearly with things like decode width. For example, the delay in the decode stage itself scales quadratically with the width of the decoder: ftp://ftp.cs.wisc.edu/sohi/trs/complexity.1328.pdf (p. 15). The delay of the issue queues is quadratic both in the issue width and the depth of the queues. The delay of a full bypass network is quadratic in execution width. Decoding N instructions at a time also requires a register renaming unit that can perform register renaming for that many instructions per cycle, and the register file must have enough ports to be able to feed 2-3 operands to N different instructions per cycle. Additionally, big, multi-ported register files, deep and wide issue queues, and big reorder buffers also tend to be extremely power hungry.
On the flip side, the conventional wisdom is that most code doesn’t have enough inherent parallelism to take advantage of an 8-wide machine: https://www.realworldtech.com/shrinking-cpu/2/ (“The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate.”). At the very least, such designs tend to be very application-dependent. Branch-y integer code like compilers tend to perform poorly on such wide and slow designs. The M1 by contrast manages to come close to Zen 3, which is already a high ILP CPU to begin with, despite a large clock speed deficit (3.2 ghz versus 5 ghz). And the performance seems to be robust—doing well on everything from compilation to scientific kernels. That’s really phenomenal and blows a lot of the conventional wisdom out of the water.
An insane amount of good engineering went into this CPU.