The answer of wide decode and deep reorder buffer gets much closer than the “tri...

temac · on Nov 30, 2020

> An insane amount of good engineering went into this CPU.

I agree but lets not overblow the difficulties either.

> For example, the delay in the decode stage itself scales quadratically with the width of the decoder.

That could be irrelevant for small enough numbers, and ARM is easier to decode than x86. So this can be very well be dominated by other things. What you cite seems to be only about decoding logical register decoding going into the renaming structures, and then for just that tiny part it even tells that "We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width."

> The delay of a full bypass network is quadratic in execution width.

Maybe if that's a problem don't do a full bypass network.

> dropping rapidly as both a function of increasing width and increasing clock rate

Good thing that the clock rate is not too high then :p

More seriously the M1 can keep the beast fed probably because everything is dimensioned correctly, (and yes also because the clocks are not too high, but if you manage to make a wide and slow CPU that actually works well, I don't see why you would want to scale the freq too much high, given you would quickly consume like crazy, and there is only only limited headroom above 3.2GHz anyway). It obviously helps to have a gigantic OOO. So I don't really see where there is so much surprise. Esp. since we saw the progression in the A series.

To finish probably TSMC 5nm does not hurt. The competitors are on bigger nodes and have smaller structures. Coincidence? Or just like it has worked during decades already.

rayiner · on Nov 30, 2020

It's not completely groundbreaking, but painting it as an outgrowth of existing trends doesn't give Apple enough credit. The challenges of scaling wider CPUs within available power budgets is widely accepted: https://www.cse.wustl.edu/~roger/560M.f18/CSE560-Superscalar... (for "high-performance per watt cores," optimal "issue width is ~2"). Intel designed an entire architecture, Itanium, around the theory that OOO scaling would hit a point of diminishing returns. https://www.realworldtech.com/poulson ("Many of them were convinced that dynamic instruction scheduling and out-of-order execution would ultimately prove to be too complex and power hungry."). It is also well accepted that we are hitting limits on ability to extract instruction-level parallelism: https://docencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProces... ("There just aren’t enough instructions that can actually be executed in parallel!"); https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp16/cs... ("Hardly more than 1-2 IPC on real workloads").

Apple being able to wring robust performance out of an 8-wide 3.2 GHz design, on a variety of benchmarks, is impressive and unexpected. For example, the M1 outperforms a Ryzen 5950X by 15%. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste.... Zen 3 is either 4- or 8-wide decode (depending on whether you hit the micro-op cache) and boosts to 5 GHz. It beats the 10900k, a 4+1-way design that boosts to 5.1 GHz, by 25%. The GCC subtest, meanwhile, is famous for being branch-heavy code with low parallelism. Apple extracting 80% more IPC from that test than AMD's latest core (which is already a very impressive, very wide core to begin with!) is very unexpected.

A lot of the conventional wisdom is based on assumptions about branch prediction and memory disambiguation, which have major impacts on how much ILP you can extract: http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi.... To do so well, Apple must be doing something very impressive on both fronts.

ece · on Dec 1, 2020

The i7-1165G7 is within 20% on single core performance of the M1. The Ryzen 4800U is within 20% on multi-core performance. Both are sustained ~25W parts similar to the M1. If you turned x64 SMT/AVX2 off, normalized for cores (Intel/AMD 4/8 vs Apple 4+4), on-die cache (Intel/AMD 12M L2+L3 vs Apple 32MB L2+L3) and frequency (Apple 3.2 vs AMD/Intel 4.2/4.7), you'd likely get very close results on 5nm or 7nm-equivalent same process. Zen 3 2666 vs 3200 RAM alone is about a 10% difference. The M1 is 4266 RAM IIRC.

TBH, Laptop/Desktop level performance is indeed very nice to see out of ARM, after a few years of false starts by a few startups and Qualcomm. Apple designed a wider core they deserve credit for, but wider cores have been a definitive trend starting with the Pentium M vs Pentium 4. There is a trade-off here for die area IMO, AMD/Intel have AVX2 and even AVX512 and SMT on each core, and narrower cores (with smaller structures, higher frequency). Apple has wider cores (larger structures, less frequency, higher IPC). It's not that simple, but it kind of is if you squint a bit.

rayiner · on Dec 1, 2020

The i7-1165G7 boosts to 4.7 GHz, 50% higher than the M1. A 75% uplift in IPC (20% more performance at 2/3 the clock speed) compared to Intel’s latest Sunny Cove architecture is enormous. Especially since Sunny Cove is itself the biggest update to Intel’s architecture since Sandy Bridge a decade ago.

ece · on Dec 1, 2020

Like I said, this is absolutely a die-size tradeoff IMO. That 75% IPC gain is only around a ~20% difference in geekbench and at similar sustained power levels. If you want AVX2/512+SMT, a slightly narrower core of realistically 6+ wide with uOP-cache upto 8-wide is an acceptable tradeoff. We have seen Zen 3 go wider from Zen 1/2[1], so wider x64 designs with AVX/SMT should be coming, but this is the squinting part with TSMC 5nm vs 7nm.

[1] https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...

rayiner · on Dec 1, 2020

Intel’s 10nm is equivalent to TSMC’s 7nm, so we’re just talking one generation on the process side. I don’t think you can chalk a 75% IPC gain to a die shrink. That’s a much bigger IPC uplift than Intel has achieved from Sandy Bridge to Sunny Cove, which happened over 4-5 die shrinks.

The total performance gain, comparing a 4.7 GHz core to a 3.2 GHz core, is 20%. But there is more to it than bottom line. The conventional wisdom would tell you that increasing clock speed is better than making the core wider because of diminishing returns to chasing IPC. Intel has followed the CW for generations: it has made cores modestly wider and deeper, but has significantly increased clock speed. Intel doubled the size of the reorder buffer from Sandy Bridge to Sunny Cove. Intel increased issue width from 5 to 6 over 10 years.

If your goal was to achieve a 20% speed-up compared to Sunny Cove, in one die shrink, the CW would be to make it a little wider and a little deeper but try to hit a boost clock well north of 5 GHz. It wouldn’t tell you to make it a third wider and twice as deep at the cost of dropping the boost clock by a third. Apple isn’t just enjoying a one-generation process advantage, but is hitting a significantly different point in the design space.

ece · on Dec 1, 2020

Superscalar vs super-pipelining isn't new. If there's no magic, then a third wider would likely exactly decrease the boost clock by a third with perfect code. With SMT off, I get 25-50% more performance on single threaded benchmarks, that's because a thread does get full access to 50% more decode/execution units in the same cycle. It's not that simple again, but that's likely the simplest example.

The M1 is definitely a significantly different point in the design space. Intel is also doing big/little designs with Lakefield, but it's still a bit early to see where that goes for x64. I don't think Intel/AMD have specifically avoided going wider as fast as Apple; AVX/AVX2/AVX512 probably take up more die-area than going 1/3 wider, and that's what they've focused on with extensions over the years. If there is an x64 ISA limitation to going wider, we'll find out, but that's highly unlikely IMO.

rayiner · on Dec 1, 2020

> Superscalar vs super-pipelining isn't new. If there's no magic, then a third wider would likely exactly decrease the boost clock by a third with perfect code.

It's not new, but it's surprising. You're correct that going a third wider at the cost of a third of clockspeed is a wash with "perfect code" but the experience of the last 10-20 years is that most code is far from perfect: https://www.realworldtech.com/shrinking-cpu/2/

> The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate. Execution efficiency (actual instruction execution rate divided by peak execution rate) dropped with increasing superscalar issue width because the amount of instruction level parallelism (ILP) in most programs is limited.... The ILP barrier is a major reason that high end x86 MPUs went from fully pipelined scalar designs to 2-way superscalar in three years and then to 3-way superscalar in another 3 years, but have been stuck at 3-way issue superscalar for the last nine years.

Theoretical studies have shown that higher ILP is attainable (http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi...) but the M1 suggests some really notable advances in being able to actually extract higher ILP in real-world code.

I agree there's probably no real x86-related limitation to going wider, if you've got a micro-op cache. As noted in the study referenced above, I suspect its the result of very good branch prediction, memory disambiguation, and an extremely deep reorder window. Each of those is an engineering feat. Designing a CPU that extracts 80% more ILP than Zen 3 in branch-heavy integer benchmarks like SPEC GCC is a major engineering feat.

r00fus · on Dec 1, 2020

The M1 is a 10W part, no? I would kill to see the 25W M-series chip.

Oh and the 10W is for the entire SOC, GPU and memory included.

ece · on Dec 1, 2020

Nope. Anandtech measured 27W peak on M1 CPU workloads with the average closer to 20W+[1].

The Ryzen 4800U and i7-1165G7 also have comparable GPUs (and TPU+ISP for the i7) within the same ~15-25W TDP. The Intel i7-1165G7 average TDP might be closer to ~30W because of it's 4.7Ghz boost clock, but it's still comparable to the M1.

The i7-1165G7 and 4800U have a few laptop designs with soldered RAM. You can get 17hrs+ of video out of a 4800U laptop with a 60Wh battery[2]. Also comparable with i7-1065G7/i7-1165G7 at 15hrs+/50Wh.

[1] https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

[2] https://www.pcworld.com/article/3531989/battery-life-amd-ryz...

willyt · on Dec 1, 2020

Wasn’t 27W for the whole Mac Mini machine using a meter at the wall plug? So that includes losses in the power supply and ssd and everything else outside the chip that uses a bit of juice whereas the AMD tdp is just the chip. I thought Anandtech said there was currently no reliable way to do an ‘apples to apples’ tdp comparison?

Edit: quote from anandtech:

“As we had access to the Mac mini rather than a Macbook, it meant that power measurement was rather simple on the device as we can just hook up a meter to the AC input of the device. It’s to be noted with a huge disclaimer that because we are measuring AC wall power here, the power figures aren’t directly comparable to that of battery-powered devices, as the Mac mini’s power supply will incur a efficiency loss greater than that of other mobile SoCs, as well as TDP figures contemporary vendors such as Intel or AMD publish.”

caribbeanblue · on Dec 2, 2020

The M1 doesn’t use 24W, it uses 12-16 watts. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste... (CPU, GPU + DRAM combined)

glasshead969 · on Dec 1, 2020

I have device with me and on full load on both CPU and GPU it can go up to 25 but for most use cases i see the whole SOC hovering around 15W

sspiff · on Dec 2, 2020

M1 is a 20W (max) CPU and a 40W SoC (whole package max).

However, in most intense workloads it doesn't go near 40W, more like ~25W under high load.

Still incredibly impressive.

jtuente · on Dec 1, 2020

18W CPU peak power, 10W is their power efficiency comparison point.

caribbeanblue · on Dec 2, 2020

The M1 doesn’t use 24W, it uses 12-16 watts. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste... (CPU, GPU + DRAM combined)

https://www.youtube.com/watch?v=_MUTS7xvKhQ&list=PLo11Rczpzu... Check this out, 12.5W power consumption for the M1 CPU vs. 68W CPU power consumption for the Intel i9 CPU of the 16” Macbook Pro, and yet the M1 is 8% faster in Cinebench R23 in multi-core score.

p1necone · on Dec 1, 2020

My naive assumption would be that 4c big + 4c little would perform better than 4c/8t all other things being equal (and assuming software was written to optimize for each design respectively). Also no reason you can't have 4c/8t big + 4c/8t little too.

caribbeanblue · on Dec 2, 2020

Apple’s 4big + 4LITTLE config performs better than Intel’s 8c/16t mobile chips right now.

Dylan16807 · on Dec 1, 2020

> For example, the delay in the decode stage itself scales quadratically with the width of the decoder: ftp://ftp.cs.wisc.edu/sohi/trs/complexity.1328.pdf (p. 15).

That's a decoder for a single field, where width of the field is the parameter it scales by. That would be instruction size or smaller, and instructions don't change size depending on how many you decode at once.

And logically once you separate the instructions you can decode in parallel in fixed time, and if all your instructions are 4 bytes then it takes no circuitry to separate them.

Also: "We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width."

Tuna-Fish · on Dec 1, 2020

Your first source does not support your statement.

While there is theoretically a quadratic component, in their words:

> We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width.

kllrnohj · on Dec 1, 2020

> That still doesn’t explain how Apple built an 8-wide CPU with such deep OOO that operates on 10-15 watts.

Well, because it doesn't, it's ~25 watts. And also because it runs at just 3ghz. You'll see similar power numbers from x86 CPUs at 3ghz, too. The M1's multicore performance vs. the 4800U and 4900HS demonstrate this nicely.

QuixoticQuibit · on Nov 30, 2020

I haven’t read the linked AnandTech article yet, but is there a clear answer why Apple was able to defy common comp arch wisdom (M1 has wider decode which works fine for various applications/code)?

diegof79 · on Nov 30, 2020

Check the parent article that explains it well. Apple didn’t defy common comp arch wisdom... they applied it.

The reason why is hard for Intel/AMD to do the same is not the lack of engineering geniuses (I’m sure they have plenty), but the support for a legacy ISA, and a particular business model.

What Apple defies is common business survival instincts: why spent so much in RD of a chip if there are market leaders that are impossible to beat? The answer seems to be obvious now... but probably it wasn’t obvious when Apple acquired PA Semi in 2008.

baybal2 · on Nov 30, 2020

> What Apple defies is common business survival instincts: why spent so much in RD of a chip if there are market leaders that are impossible to beat?

Having own silicon, means the upstream will not be able to turn lights on you (Samsung — a company keeping a quarter of its host country's GDP hostage.) I believe the immediate goal of PA Semi purchase was that.

> The answer seems to be obvious now... but probably it wasn’t obvious when Apple acquired PA Semi in 2008.

PA Semi was clearly a diamond in the rough. It took a great insight to single out PA Semi, because on the surface it was a very barebones SoC sweatshop, but in reality PA were the last of Mohicans of US chip design.

PA was a place where all non-Intel IC engineers left to after the severe carnage of microchip businesses of US tech giants like Sun, IBM, HP, DEC, SGI..., and etc.

It was a star team which back then was toiling at router box SoCs.