Hacker News new | past | comments | ask | show | jobs | submit | sm_1024's comments login

IMO, the most interesting thing about this line is the battery life---within an hour of MBP3 and within 2 hours of Asus's Qualcomm. Making it comparable to ARM architectures.

Which is a little surprising because ARM is commonly believed to be much more power efficient than x86.

[1] https://youtu.be/Z8WKR0VHfJw?si=A7zbFY2lsDa8iVQN&t=277


ARM got a lot of hype since the release of the M1, but most users only compared it to the terrible Intel MBPs. Ryzen mobile has been consistently close to Apple silicon perf/watt for 5 years. But got little press coverage.

Hype can be really decorrelated from real world performance.


Any efficiency comparison involving Apples chips also has to factor in that Tim Cook keeps showing up at TSMCs door with a freight container full of cash to buy out exclusive access to their bleeding edge silicon processes. ARM may be a factor but don't underestimate the power of having more money than God.

Case in point, Strix Point is built on TSMC 4nm while Apple is already using TSMCs second generation 3nm process.


Let's do the math on M1 Pro (10-core, N5, 2021) vs HX370 (12-core, N4P, 2024).

Firestorm without L3 is 2.281mm2. Icestorm is 0.59mm2. M1 Pro has 8P+2E for a total of 19.428mm2 of cores included.

Zen4 without L3 is 3.84mm2. Zen4c reduces that down to 2.48mm2. Zen5 CCD is pretty much the same size as Zen4 (though with 27% more transistors), so core size should be similar. AMD has also stated that Zen5c has a similar shrink percent to Zen4c. We'll use their numbers. HX370 has 4P+8C for a total area of 35.2mm2. If being twice the size despite being on N4P instead of N5 like M1 seems like foreshadowing, it is.

We'll use notebookcheck's Cinebench 2024 multithread power and performance numbers to calculate perf / power / area then multiply that by 100 to eliminate some decimals.

M1 Pro scores 824 (10-core) and while they don't have a power value listed, they do list 33.6w package power running the prime95 power virus, so cinebench's power should be lower than that.

HX370 scored 1213 (12-core) and averaged 119w (maxing at a massive 121.7w and that's without running a power virus).

This gives the following perf/power/area*100 scores:

M1 Pro — 126 PPA

HX 379 — 29 PPA

M1 is more than 4.3x better while being an entire node behind and being released years before.


119W for hx370 looks extremely sus, seems to me more like the system level power consumption and not CPU-only.

According to phoronix [1,2], in their blender CPU test, they measured a peak of 33W.

Here max power numbers from some other tests that I know are multi-threaded:

--

Linux 6.8 Compilation: 33.13 W

LLVM Compilation: 33.25 W

--

If I plug in 33W into your equation, that would give us score of HX 370: 104 PPA

This supports the HX 370 being pretty power efficient, although still not as power efficient as M3.

[1] https://www.phoronix.com/review/amd-ryzen-ai-9-hx-370/3

[2] https://www.phoronix.com/review/amd-ryzen-ai-9-hx-370/4


https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...

They got those kinds of numbers across multiple systems. You can take it up with them I guess.

I didn't even mention one of these systems was peaking at 59w on single-core workloads.


I see what's going on, they have two HX370 laptops:

  Laptop  MC score  Avg Power
     P16      1213      113 W
     S16       921       29 W
  M3 Pro      1059    (30 W?)
They don't have M3 Pro power numbers, but I assume it is somewhere around 30W, seems like S16 has similar power efficiency as HX 370 at 30 W.

Any more power, and the CPU is much less power efficient, 300% increase in power for 30% increase in performance.


This is true for every CPU. Past a certain point power consumption scales quadratically with performance.


About cinebench-geekbench-spec: https://old.reddit.com/r/hardware/comments/pitid6/eli5_why_d... That's about Cinebench 20, an overview of Cinebench 24 cpu&gpu(!): https://www.cgdirector.com/cinebench-2024-scores/


Even with the M3 the difference is marginal in multi-threaded benchmarks, from the Cinebench link [1] someone posted earlier on the thread.

    Apple M3 Pro 11-Core - 394 Points per Watt
    AMD Ryzen AI 9 HX 370 - 354 Points per Watt
    Apple M3 Max 16-Core - 306 Points per Watt
And the Ryzen in on TSMC 4nm while the M3 is on 3nm. As parent is saying, a lot of the Apple Silicon hype was due to the massive upgrade it was over the Intel CPUs Apple was using previously.

[1]: https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...


Their efficiency tests use Cinebench R23 (as called out explicitly).

R23 is not optimized for Apple silicon but is for x86. The R24 numbers are actually what you need for a fair comparison, otherwise you put the Arm numbers at a significant handicap.


That the max should be worse than the m3 pro is a little bit shady.


Cinebench might not be the most relevant benchmark, it uses lots of scalar instructions with fairly high branch mispredictions and low IPC: https://chipsandcheese.com/2021/02/22/analyzing-zen-2s-cineb....


Power efficiency is a curve, and Apple may have its own reason not to make M1 Pro run at 110W as well


I think the OC might have mis-read the power numbers, 110 W is well into desktop CPU power range. Here is a excerpt from Anand Tech:

> In our peak power test, the Ryzen AI 9 HX 370 ramped up and peaked at 33 W.

https://www.anandtech.com/show/21485/the-amd-ryzen-ai-hx-370...


You can read the notebookcheck review for yourself.

https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...


Those 100W+ numbers are total system power. And that system has the CPU TDP set to 80W (far above AMD's official max of 54W). It also has a discrete 4070 GPU that can use over 100W on its own.


if x86 laptops have 90w of platform power, that’s a thing that’s concerning in itself, not a reasonable defense.

Remember, apple laptops have screens too, etc, and that shows up in the average system power measurements the same way. What's the difference in an x86 laptop?

I really doubt it's actually platform power, the problem is that x86 is boosting up to 35W average/60W peak per thread. 120W package power isn't unexpected, if you're boosting 3-4 cores to maximum!

And that's the problem. x86 is far far worse at race-to-sleep. It's not just "macos has better scheduling"... you can see from the 1T power measurements that x86 is simply drawing 2-3x the power while it's racing-to-sleep, for performance that's roughly equivalent to ARM.

Whatever the cause, whether it's just bad design from AMD and Intel, or legacy x86 cruft (I don't get how this applies to actual computational load though, as opposed to situations like idle power), or what... there is no getting around the fact that M2 tops out at 10W per core and a 8840HS or HX370 or Intel Meteor Lake are boosting to 30-35W at 1T loads.


I stacked the deck in AMD's favor using a 3-year-old chip on an older node.

Why is AMD using 3.6x more power than M1 to get just 32% higher performance while having 17% more cores? Why are AMD's cores nearly 2x the size despite being on a better node and having 3 more years to work on them?

Why are Apple's scores the same on battery while AMD's scores drop dramatically?

Apple does have a reason not to run at 120w -- it doesn't need to.

Meanwhile, if AMD used the same 33w, nobody would buy their chips because performance would be so incredibly bad.


You should try not to talk so confidently about things you don't know about -- this statement

> if AMD used the same 33w, nobody would buy their chips because performance would be so incredibly bad

Is completely incorrect, as another commenter (and I think the notebookcheck article?) point out -- 30w is about the sweet spot for these processors, and the reason that 110w laptop seems so inefficient is because it's giving the APU 80w of TDP, which is a bit silly since it only performs marginally better than if you gave it e.g. 30 watts. It's not a good idea to take that example as a benchmark for the APU's efficiency, it varies depending on how much TDP you give the processor, and 80w is not a good TDP for these


Halo products with high scores sell chips. This isn’t a new idea.

So you lower the wattage down. Now you’re at M1 Pro levels of performance with 17% more cores and nearly double the die area and barely competing with a chip 3 years older while on a newer, more expensive node too.

That’s not selling me on your product (and that’s without mentioning the worst core latency I’ve seen in years when going between P and C cores).


> if AMD used the same 33w, nobody would buy their chips because performance would be so incredibly bad

I’m writing this comment on HP ProBook 445 G8 laptop. I believe I bought it in early 2022, so it's a relatively old model. The laptop has a Ryzen 5 5600U processor which uses ≤ 25W. I’m quite happy with both the performance and battery life.


It's well known that performance doesn't scale linearly with power.

Benchmarking incentives on PC have long pushed X86 vendors to drive their CPUs at points of the power/performance curve that make their chips look less efficient than they really are. Laptop benchmarking has inherited that culture from desktop PC benchmarking to some extent. This is slowly changing, but Apple has never been subject to the same benchmarking pressures in the first place.

You'll see in reviews that Zen5 can be very efficient when operated in the right power range.


Zen5 can be more efficient at lower clockspeeds, but then it loses badly to Apple's chips in raw performance.


> I stacked the deck in AMD's favor using a 3-year-old chip on an older node.

You could just compare the ones that are actually on the same process node:

https://www.notebookcheck.net/R9-7945HX3D-vs-M2-Max_15073_14...

But then you would see an AMD CPU with a lower TDP getting higher benchmark results.

> Why is AMD using 3.6x more power than M1 to get just 32% higher performance while having 17% more cores?

Getting 32% higher performance from 17% more cores implies higher performance per core.

The power measurements that site uses are from the plug, which is highly variable to the point of uselessness because it takes into account every other component the OEM puts into the machine and random other factors like screen brightness, thermal solution and temperature targets (which affects fan speed which affects fan power consumption) etc. If you measure the wall power of a system with a discrete GPU that by itself has a TDP >100W and the system is drawing >100W, this tells you nothing about the efficiency of the CPU.

AMD's CPUs have internal power monitors and configurable power targets. At full load there is very little light between the configured TDP and what they actually use. This is basically required because the CPU has to be able to operate in a system that can't dissipate more heat than that, or one that can't supply more power.

> Meanwhile, if AMD used the same 33w, nobody would buy their chips because performance would be so incredibly bad.

33W is approximately what their mobile CPUs actually use. Also, even lower-configured TDP models exist and they're not that much slower, e.g. the 7840U has a base TDP of 15W vs. 35W for the 7840HS and the difference is a base clock of 3.3GHz instead of 3.8GHz.


> Getting 32% higher performance from 17% more cores implies higher performance per core.

I don't disagree that it is higher perf/core. It is simply MUCH worse perf/watt because they are forced to clock so high to achieve those results.

> The power measurements that site uses are from the plug, which is highly variable to the point of uselessness

They measure the HX370 using 119w with the screen off (using an external monitor). What on that motherboard would be using the remaining 85+W of power?

TDP is a suggestion, not a hard limit. Before thermal throttling, they will often exceed the TDP by a factor of 2x or more.

As to these specific benchmarks, the R9 7945HX3D you linked to used 187w while the M2 Max used 78w for CB R15. As to perf/watt, Cinebench before 2024 wasn't using NEON properly on ARM, but was using Intel's hyper-optimized libraries for x86. You should be looking at benchmarks without such a massive bias.


> I don't disagree that it is higher perf/core. It is simply MUCH worse perf/watt because they are forced to clock so high to achieve those results.

The base clock for that CPU is nominally 2 GHz.

> They measure the HX370 using 119w with the screen off (using an external monitor). What on that motherboard would be using the remaining 85+W of power?

For the Asus ProArt P16 H7606WI? Probably the 115W RTX 4070.

> TDP is a suggestion, not a hard limit. Before thermal throttling, they will often exceed the TDP by a factor of 2x or more.

TDP is not really a suggestion. There are systems that can't dissipate more than a specific amount of heat and producing more than that could fry other components in the system even if the CPU itself isn't over-temperature yet, e.g. because the other components have a lower heat tolerance. There are also systems that can't supply more than a specific amount of power and if the CPU tried to non-trivially exceed that limit the system would crash.

The TDP is, however, configurable, including different values for boost. So if the OEM sets the value to the higher end of the range even though their cooling solution can't handle it, the CPU will start out there and gradually lower its power use as it becomes thermally limited. This is not the same as "TDP is a suggestion", it's just not quite as simple as a single number.

> As to these specific benchmarks, the R9 7945HX3D you linked to used 187w while the M2 Max used 78w for CB R15.

Which is the same site measuring power consumption at the plug on an arbitrary system with arbitrary other components drawing power. Are they even measuring it though the power brick and adding its conversion losses?

These CPUs have internal power meters. Doing it the way they're doing it is meaningless and unnecessary.

> You should be looking at benchmarks without such a massive bias.

Do you have one that compares the same CPUs on some representative set of tests and actually measures the power consumption of the CPU itself? Diligently-conducted benchmarks are unfortunately rare.

Note however that the same link shows the 7945HX3D also ahead in Blender, Geekbench ST and MT, Kraken, Octane, etc. It's consistently faster on the same process, and has a lower TDP.


lmao he’s citing cinebench R15? Which isn’t just ancient but actually emulated on arm, of course.

Really digging through the vaults for that one.

Geekbench 6 is perfectly fine for that stuff. But that still shows apple tieing in MT and beating the pants off x86 in 1T efficiency.

x86 1T boosts being silly is where the real problem comes from. But if they don’t throw 30-35w at a single thread they lose horribly.


> lmao he’s citing cinebench R15?

It's the only one where they measured the power use. I don't get to decide which tests they run. But if their method of measuring power use is going to be meaningless then the associated benchmark result might as well be too, right?

> Geekbench 6 is perfectly fine for that stuff. But that still shows apple tieing in MT and beating the pants off x86 in 1T efficiency.

It shows Apple behind by 8% in ST and 12% in MT with no power measurement for that test at all, but an Apple CPU with a higher TDP. Meanwhile the claim was that AMD hadn't even caught up on the same process, which isn't true.

> x86 1T boosts being silly is where the real problem comes from. But if they don’t throw 30-35w at a single thread they lose horribly.

They don't use 30-35W for a single thread on mobile CPUs. The average for the HX 370 from a set of mostly-threaded benchmarks was 20W when you actually measure the power consumption of the CPU:

https://www.phoronix.com/review/amd-ryzen-ai-9-hx-370/13

On single-threaded tests like PyBench the average was 10W:

https://www.phoronix.com/review/amd-ryzen-ai-9-hx-370/9

34W was the max across all tests, presumably the configured TDP for that system, derived from the tests like compiling LLVM that max out arbitrarily many cores.


Process helps but have you seen benchmarks showing equivalent performance between the same process node? I think it’s less that ARM is amazing than the Apple Silicon team being very good and paired with aggressive optimization throughout the stack but everything I’ve seen suggests they are simply building better chips at their target levels (not server, high power, etc.).


> Our benchmark database shows the Dimensity 9300 scores 2,207 and 7,408 in Geekbench 6.2's single and multi-core tests. A 30% performance improvement implies the Dimensity 9400 would score around 2,869 and and 9,630. Its single-core performance is close to that of the Snapdragon 8 Gen 4 (2,884/8,840) and it understandably takes the lead in multi-core. Both are within spitting distance from the Apple A17 Pro, which scores 2,915 and 7,222 points in the benchmark. Then again, all three chips are said to be manufactured on TSMC's N3 class node, effectively leveling the playing field.

https://www.notebookcheck.net/MediaTek-Dimensity-9400-rumour...


That appears to be an unconfirmed rumor and it’s exciting if true (and there aren’t major caveats on power), but did you notice how they mentioned extra work by ARM? The argument isn’t that Apple is unique, it’s that the performance gaps they’ve shown are more than simply buying premium fab capacity.

That doesn’t mean other designers can’t also do that work, but simply that it’s more than just the process - for example, the M2 shipped on TSMC’s N5P first as an exclusive but when Zen 5 shipped later on the same process it didn’t close the single core performance or perf/watt gap. Some of that is x86 vs. ARM but there isn’t a single, simple factor which can explain this - e.g. Apple carefully tuning the hardware, firmware, OS, compilers, and libraries too undoubtably helps a lot and it’s been a perennial problem for non-Intel vendors on the PC side since so many developers have tuned for Intel first/only for decades.


> for example, the M2 shipped on TSMC’s N5P first as an exclusive but when Zen 5 shipped later on the same process it didn’t close the single core performance or perf/watt gap.

That was Zen 4, but it did close the gap:

https://www.notebookcheck.net/R9-7945HX3D-vs-M2-Max_15073_14...

Single thread performance is higher (so is MT), TDP is slightly lower, Cinebench MT "points per watt" is 5% higher.

We'll get to see it again when the 3nm version of Zen5 is released (the initial ones are 4nm, which is a node Apple didn't use).


Since it's unclear whether Apple has a significant architectural advantage over Qualcomm and MediaTek, I would rather attribute this to relatively poor AMD architectures. Provisionally. At least their GPUs have been behind Nvidia for years. (AMD holding its own against Intel is not surprising given Intel's chip fab problems.)


Yes, to be clear I’d be very happy if MediaTek jumps in with a strong contender since consumers win. It doesn’t look like the Qualcomm chips are performing as well as hoped but I’d wait a bit to see how much tuning helps since Windows ARM was not a major target until now.


I guess getting close to the same single thread score is nice. Unfortunately, since only Apple is shipping it is hard to compare if the others burn the battery to get there.

I suspect the others two, like Apple with the A18 shipping next month, will be using the second gen N3. Apple is expected to be around 3500 on that node.

Needless to say, what will be very interesting is to see the perf/watt of all three on the same node and shipping in actual products where the benchmarks can be put to more useful tests.


Yeah, and GPU tests, since the benchmarks above were only for the CPU.


But how is this the case? I never saw a single article mentioning that a non-Mac laptop was better.

(Random article saying M3 pro is better than a Dell laptop https://www.tomsguide.com/news/macbook-pro-m3-and-m3-max-bat... )


You're right, but... The idea comes from the desktop world. AMD's zen 4 desktop cpu's especially the gaming variants like the Ryzen 7 7800X3D almost matches the performance per watt of Apple's M3.

Their laptop cpu's as some companies did release same model different cpu were less efficient than intel.

But the Asus ProArt P16 (used in the article) did manage an extreme endurance score in the video test called Big Buck Bunny H.264 1080p which runs at 150 cd/m² with 21 hours. With it's higher resolution, oled and 10% less battery capacity that's better 40 minutes better than the macbook pro 16 m3 max. In the wifi test also run at 150 cd/m² the m3 run for 16 hours, the asus 8. ( https://www.notebookcheck.net/Asus-ProArt-P16-laptop-review-... )

For me noise matters, that Asus has a whisper mode which produces 42db as much as an M3 max under full load. Please be aware that if you're susceptible of PWM, that ASUS laptop has issues.


I have heard that part of the reason for little coverage of ryzen mobile CPUs is their limited availability as AMD was focussing on using the fab capacity for server chips.


I think that's because all the press talks about actual battery life per laptop and the Apple Silicone laptops ship with literally double the size battery of any AMD based laptop without a discrete GPU. So while the efficiency may be close, actually perceived battery life of the Mac will he more than double when you also consider the priority Apple puts into their power control combined with a larger overall battery.


Ryzen mobile is consistently close, yeah. But with the sole exception of the Steam deck, I've yet to see a Ryzen mobile-bearing laptop, Windows included, which is close to the overall performance of the Macbook.


"overall performance" does a lot of work here. On sheer benchmarks it's really comparable, with AMD being slightly better depending on what you look at. e.g. the M1 vs the 5700U (a similar class widely available mobile CPU):

https://www.cpubenchmark.net/cpu.php?cpu=AMD%20Ryzen%207%205...

https://www.cpubenchmark.net/cpu.php?cpu=Apple+M1+8+Core+320...

They're not profiled the same, and don't belong in the same ecosystem though, which makes a lot more difference than the CPU themselves. In particular the AMD doesn't get a dedicated compiler optimizing every applications of the system to its strength and weaknesses (the other side of it being the compatibility with the two vastest ecosystem we have now)


Depends on what you mean by "overall performance", but my Asus ROG Zephyrus G14 2023 is full AMD, and outperforms my work issued top of the line M1 MacBook Pro from a few months earlier in every task I've done across the two (gaming, compiling, heavy browsing). Battery life is lower under heavy load and high performance on the Zephyrus, but in power saving mode it's roughly comparable, albeit still worse.


Same here, my G14 and the M1 MBP are pretty much interchangeable for most workloads. The only time then G14 starts fans is when the 4070 turns on... and that's not an option on the M1 at all.


> But with the sole exception of the Steam deck

Uuh wut? The Steam Deck is like 3-generation-old hardware in mobile Ryzen terms. In a lot of ways it's similar to a pared-back 4800u with fewer (and older) cores, and a slightly bumped up GPU.

To me it's kinda the opposite. Excluding the Steam Deck, I think most of AMD's Ultrabook APUs have been very close to the products Apple's made on the equivalent nodes. Even on 7nm the 4800u put up a competitive fight against M1, and the gap has gotten thinner with each passing year. According to the OpenCL benchmarks, the Radeon 680m on 6nm scores higher than the M1 on 5nm: https://browser.geekbench.com/opencl-benchmarks

Even back when Ryzen Mobile only shipped with Vega, it was pretty clear that Apple and AMD were a pretty close match in onboard GPU power.


Steam Deck might be behind in terms of hardware but in terms of software it's way beyond your typical x86 linux system power efficiency, and dare I say it's doing better than windows machines with the typical shoddy bioses and drivers, specially when you consider all the extraneous services constantly sapping varying amounts of cpu time. All that contributes to make the SD punch well above its weight.


My Alienware M15 Ryzen edition gets 7-8W power consumption by just running "sudo powertop --autotune". Basically all of the power efficiency stuff in the Steam Deck apply to other Ryzen systems and are in the mainline kernel.


Battery tests are important, but so is how it fairs on battery (what is the performance drop off to maintain that), what’s its performance is ant its peak and how it long before it throttles when pushed.

The M series processors have succeeded in all four: battery life, performance parity between battery and plugged in, high performance and performance sustainability.

So far, very few benchmarks have been comparing the latter three as part of the full package assessment.


> because ARM is commonly believed to be much more power efficient than x86.

Because most ARM processors were designed for mobile phones and optimised to death for power efficiency.

The total power usage of the front end decoders is a single digit percentage of the total power draw. Even if ARM magically needed 0 watts for this, it couldn’t save more power than that. The rest of the processor design elements are essentially identical.


>5hr Battery life in laptops is mostly a function of how well idle is managed, i think. The less work you can do while running the users core program, the better. I'm not sure how much impact CPU efficiency really has in that case.

If you are running a remotely demanding program (say, a game) , your battery life will be bad no matter what (ie. <4hrs) unless you choose a very low TDP that performs badly always.

A laptop at idle should be able to manage ~5w power consumption sumtpion regardless of AMD/intel/Apple processor, but it's largely on the OS to achieve that.


I have a 365 AMD laptop.

The battery is great if your doing very light stuff, Call of Duty takes it's battery down to 3 hours.

Macs don't really support higher end games, so I can't directly compare to my M1 Air.


How does “great” translate to hours?


This is really tricky.

The OEMs will use ever trick possible and do something like open GMAIL to claim 10 hours, but given my typical use I average 5 to 6. I make music using a software called Maschine.

It's a massive step up over my old( still working just very heavy) Lenovo Legion 2020, which would last about 2 hours given the same usage.

This is all subjective at the end of the day. If none of your applications actually work since your on ARM Windows of course you'll have higher battery life.


The CPU core's instruction set has no influence on how well the chip as a whole manages power when not executing instructions.


That is fair, I was taught that decoders for x86 are less efficient and more power hungry than RISC ISAs because of their variable length instructions.

I remember being told (and it might be wrong) that ARM can decode multiple instructions in parallel because the CPU knows where the next instruction starts, but for x86, you'd have to decode the instructions in order.


That seems to not matter much nowadays. There's another great(according to my untrained eye) writeup of the lack of importance on chips and cheese.

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...


The various mentioned power consumption amounts are 4-10% per-core, or 0.5-6% of package (with the caveat of running with micro-op cache off) for Zen 2, and 3-10% for Haswell. That's not massive, but is still far from what I'd consider insignificant; it could give leeway for an extra core or some improved ALUs; or, even, depending on the benchmark, is the difference between Zen 4 and Zen 5 (making the false assumption of a linear relation between power and performance, at least), which'd essentially be a "free" generational improvement. Of course the reality is gonna be more modest than that, but it's not nothing.


You missed the part where they mention ARM ends up implementing the same thing to go fast.

The point is processors are either slow and efficient, or fast and inefficient. It's just a tradeoff along the curve.


ARM doesn't need the variable-length instruction decoding though, which on x86 essentially means that the decoder has to attempt to decode at every single byte offset for the start of the pipeline, wasting computation.

Indeed pretty much any architecture can benefit from some form of op cache, but less of a need for it means its size can be reduced (and savings spent in more useful ways), and you'll still need actual decoding at some point anyway (and, depending on the code footprint, may need it a lot).

More generally, throwing silicon at a problem is, quite obviously, a more expensive solution than not having the problem in the first place.


x86 processors simply run a instruction length predictor the same way they do it for branch prediction. That turns the problem into something that can be tuned. Instead of having to decode the instruction at every byte offset, you can simply decide to optimize for the 99% case with a slow path for rare combinations.


That's still silicon spent on a problem that can be architecturally avoided.


But bigger fixed-length instructions mean more I$ pressure, right?


RISC doesn't imply wasted instruction space; RISC-V has a particularly interesting thing for this - with the compressed ('c') extension you get 16-bit instructions (which you can determine by just checking two bits), but without it you can still save 6% of icache silicon via only storing 30 bits per instruction, the remaining two being always-1 for non-compressed instructions.

Also, x86 isn't even that efficient in its variable-length instructions - some half of them contain the byte 0x0F, representing an "oh no, we're low on single-byte instructions, prefix new things with 0F". On top of that, general-purpose instructions on 64-bit registers have a prefix byte with 4 fixed bits. The VEX prefix (all AVX1/2 instructions) has 7 fixed bits. EVEX (all AVX-512 instructions) is a full fixed byte.


https://oscarlab.github.io/papers/instrpop-systor19.pdf

ARM64 instructions are 4 bytes. x86 instructions in real-world code average 4.25 bytes. ARM64 gets closer to x86 code size as it adds new instructions to replace common instruction sequences.

RISC-V has 2-byte and 4-byte instructions and averages very close to 3-bytes. Despite this, the original compressed code was only around 15% more dense than x86. The addition of the B (bitwise) extensions and Zcb have increased that advantage by quite a lot. As other extensions get added, I'd expect to see this lead increase over time.


x86-64 wastes enough of its address space that arm64 is typically smaller in practice. The RISC-V folks pointed this out a decade ago - geomean across their SPEC suite, x86 is 7.3% larger binary size than arm64.

https://people.eecs.berkeley.edu/%7Ekrste/papers/EECS-2016-1...

So there’s another small factor leaning against x86 - inferior code density means they get less out of their icache than ARM64 due to their ISA design (legacy cruft). And ARM64 often has larger icaches anyway - M1 is 6x the icache of zen4 iirc, and they get more out of it with better code density.

<uno-reverse-card.png>


That stuff is WAY out-of-date and was flatly wrong when it was published.

A715 cut decoder size a whopping 75% by dropping the more CISC 32-bit stuff and completely eliminated the uop cache too. Losing all that decode, cache, and cache controllers means a big reduction in power consumption (decoders are basically always on). All of ARM's latest CPU designs have eliminated uop cache for this same reason.

At the time of publication, we already knew that M1 (already out for nearly a year) was the highest IPC chip ever made and did not use a uop cache.


Clam makes some serious technical mistakes in that article and some info is outdated.

1. His claim that "ARM decoder is complex too" was wrong at the time (M1 being an obvious example) and has been proven more wrong since publication. ARM dropped the uop cache as soon as they dropped support for their very CISC-y 32-bit catastrophe. They bragged that this coincided with a whopping 75% reduction in decoder size for their A715 (while INCREASING from 4 decoders to 5) and this was almost single-handedly responsible for the reduced power consumption of that chip (as all the other changes were comparatively minor). NONE of the current-gen cores from ARM, Apple, or Qualcomm use uop cache eliminating these power-hungry cache and cache controllers.

2. The paper[0] he quotes has a stupid conclusion. They show integer workloads using a massive 22% of total core power on the decoder and even their fake float workload showed 8% of total core power. Realize that a study[1] of the entire Ubuntu package repo showed that just 12 int/ALU instructions made up 89% of all code with float/SIMD being in the very low single digits of use.

3. x86 decoder situation has gotten worse. Because adding extra decoders is exponentially complex, they decided to spend massive amounts of transistors on multiple decoder blocks working on various speculated branches. Setting aside that this penalizes unrolled code (where they may have just 3-4 decoders while modern ARM will have 10+ decoders), the setup for this is incredibly complex and man-year intensive.

4. "ARM decodes into uops too" is a false equivalency. The uops used by ARM are extremely close to the original instructions as shown by them being able to easily eliminate the uop cache. x86 has a much harder job here mapping a small set of instructions onto a large set.

5. "ARM is bloated too". ARM redid their entire ISA to eliminate bloat. If ISA didn't actually matter, why would they do this?

6. "RISC-V will become bloated too" is an appeal to ignorance. x86 has SEVENTEEN major SIMD extensions excluding the dozen or so AVX-512 extensions all with various incompatibilities and issues. This is because nobody knew what SIMD should look like. We know now and RISC-V won't be making that mistake. x86 has useless stuff like BCD instructions using up precious small instruction space because they didn't know. RISC-V won't do this either. With 50+ years of figuring the basics out, RISC-V won't be making any major mistakes on the most important stuff.

7. Omitting complexity. A bloated, ancient codebase takes forever to do anything with. A bloated, ancient ISA takes forever to do anything with. If ARM and Intel both put X dollars into a new CPU design, Intel is going to spend 20-30% or maybe even more of their budget on devs spending time chasing edge cases and testers to test al those edge cases. Meanwhile, ARM is going to spend that 20-30% of their budget on increasing performance. All other things equal, the ARM chip will be better at any given design price point.

8. Compilers matter. Spitting out fast x86 code is incredibly hard because there are so many variations on how to do things each with their own tradeoffs (that conflate in weird ways with the tradeoffs of nearby instructions). We do peephole heuristic optimizations because provably fast would take centuries. RISC-V and ARM both make it far easier for compiler writers because there's usually just one option rather than many options and that one option is going to be fast.

[0] https://www.usenix.org/system/files/conference/cooldc16/cool...

[1] https://oscarlab.github.io/papers/instrpop-systor19.pdf


One more: there's more to an ISA than just the instructions; there's semantic differences as well. x86 dates to a time before out-of-order execution, caches, and multi-core systems, so it has an extremely strict memory model that does not reflect modern hardware -- the only memory-reordering optimization permitted by the ISA is store buffering.

Modern x86 processors will actually perform speculative weak memory accesses in order to try to work around this memory model, flushing the pipeline if it turns out a memory-ordering guarantee was violated in a way that became visible to another core -- but this has complexity and performance impacts, especially when applications make heavy use of atomic operations and/or communication between threads.

Simple atomic operations can be an order of magnitude faster on ARMv8 vs x86: https://web.archive.org/web/20220129144454/https://twitter.c...


"the only memory-reordering optimization permitted by the ISA is store buffering."

I think this is a mischaracterization of TSO. TSO only dictates the store ordering to other entities in the system, the individual cores are fully capable of using the results of stores that are not yet visible for their own OoO purposes as long as the dataflow dependencies are correctly solved. The complexities of the read/write bypassing is simply to clarify correct program order.

And this is why the TSO/non TSO mode on something like the apple cores doesn't seem to make a huge difference, particularly if one assumes that the core is aggressively optimized for the arm memory model, and the TSO buffering/ordering is not a critical optimization point.

Put another way, a core designed to track store ordering utilizing some kind of writeback merging is going to be fully capable of executing just as aggressively OoO and holding back or buffering the visibility of completed stores until earlier stores complete. In fact for multithreaded lock-free code the lack of explicit write fencing is likely a performance gain for very carefully optimized code in most cases. A core which can pipeline and execute multiple outstanding store fences is going to look very similar to one that implements TSO.


Yes, and Apple added this memory model to their ARM implementation so Rosetta2 would work well.


Some notes:

3: I don't think more decoders should be exponentially more complex, or even polynomial; I think O(n log n) should suffice. It just has a hilarious constant factor due to the lookup tables and logic needed, and that log factor also impacts the critical path length, i.e. pipeline length, i.e. mispredict penalty. Of note is that x86's variable-length instructions aren't even particularly good at code size.

Golden Cove (~1y after M1) has 6-wide decode, which is probably reasonably near M1's 8-wide given x86's complex instructions (mainly free single-use loads). [EDIT: actually, no, chipsandcheese's diagram shows it only moving 6 micro-ops per cycle to reorder buffer, even out of the micro-op cache. Despite having 8/cycle retire. Weird.]

6: The count of extensions is a very bad way to measure things; RISC-V will beat everything in that in no time, if not already. The main things that matter are ≤SSE4.2 (uses same instruction encoding as scalar code); AVX1/2 (VEX prefix); and AVX-512 (EVEX). The actual instruction opcodes are shared across those. But three encoding modes (plus the three different lengths of the legacy encoding) is still bad (and APX adds another two onto this) and the SSE-to-AVX transition thing is sad.

RISC-V already has two completely separate solutions for SIMD - v (aka RVV, i.e. the interesting scalable one) and p (a simpler thing that works in GPRs; largely not being worked on but there's still some activity). And if one wants to count extensions, there are already a dozen for RVV (never mind its embedded subsets) - Zvfh, Zvfhmin, Zvfbfwma, Zvfbfmin, Zvbb, Zvkb, Zvbc, Zvkg, Zvkned, Zvknhb, Zvknha, Zvksed, Zvksh; though, granted, those work better together than, say, SSE and AVX (but on x86 there's no reason to mix them anyway).

And RVV might get multiple instruction encoding forms too - the current 32-bit one is forced into allowing using only one register for masking due to lack of encoding space, and a potential 48-bit and/or 64-bit instruction encoding extension has been discussed quite a bit.

8: RISC-V RVV can be pretty problematic for some things if compiling without a specific target architecture, as the scalability means that different implementations can have good reason to have wildly different relative instruction performance (perhaps most significant being in-register gather (aka shuffle) vs arithmetic vs indexed load from memory).


3. You can look up the papers released in the late 90s on the topic. If it was O(n log n), going bigger than 4 full decoders would be pretty easy.

6. Not all of those SIMD sets are compatible with each other. Some (eg, SSE4a) wound up casualties of the Intel v AMD war. It's so bad that the Intel AVX10 proposal is mostly about trying to unify their latest stuff into something more cohesive. If you try to code this stuff by hand, it's an absolute mess.

The P proposal is basically DOA. It could happen, but nobody's interested at this point. Just like the B proposal subsumed a bunch of ridiculously small extensions, I expect a new V proposal to simply unify these. As you point out, there isn't really any conflict between these tiny instruction releases.

There is discussion around the 48-bit format (the bits have been reserved for years now), but there are a couple different proposals (personally, I think 64-bit only with the ability to put multiple instructions inside is better, but that's another topic). Most likely, a 48-bit format does NOT do multiple encoding, but instead does a superset of encodings (just like how every 16-bit instruction expands into a 32-bit instruction). They need/want 48-bits to allow 4-address instructions too, so I'd imagine it's coming sooner or later.

Either way, the length encoding is easy to work with compared to x86 where you must check half the bits in half the bytes before you can be sure about how long your instruction really is.

8. There could be some variance, but x86 has this issue too and SO many more besides.


The trend seems to be going towards multiple decoder complexes. Recent designs from AMD and Intel do this.

It makes sense to me: if the distance between branches is small, a 10-wide decode may be wasted anyway. Better to decode multiple basic blocks in parallel


I know the E-cores (gracemont, crestmont, skymont) have the multi-decoder setup; the first couple search results don't show Golden Cove being the same. Do you have some reference for that?

6. Ah yeah the funky SSE4a thing. RISC-V has its own similar but worse thing with RVV0.7.1 / xtheadvector already though, and it can be basically guaranteed that there will be tons of one-off vendor extensions, including vector ones, given that anyone can make such.

8. RVV's vrgather is extremely bad at this, but is very important for a bunch of non-trivial things; existing RVV1.0 hardware has it at O(LMUL^2), e.g. BPI-F3 takes 256 cycles for LMUL=8[1]. But some hypothetical future hardware could do it at O(LMUL) for non-worst-case indices, thus massively changing tradeoffs. So far the compiler approaches are to just not do high LMUL when vrgather is needed (potentially leaving free perf on the table), or using indexed loads (potentially significantly worse).

Whereas x86 and ARM SIMD perf variance is very tiny; basically everything is pretty proportional everywhere, with maybe the exception of very old atom cores. There'll be some differences of 2x up or down of throughput of instruction classes, but it's generally not so bad as to make way for alternative approaches to be better.

[1]: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.h...


I think you may be correct about gracemont v golden cove. Rumors/insiders say that Intel has supposedly decided to kill off either the P or E-core team, so I'd guess that the P-core team is getting layed off because the E-core IPC is basically the same, but the E-core is massively more efficient. Even if the P-core wins, I'd expect them to adopt the 3x3 decoder just as AMD adopted a 2x4 decoder for zen5.

Using a non-frozen spec is at your own risk. There's nothing comparable to stuff like SSE4a or FMA4. The custom extension issue is vastly overstated. Anybody can make extensions, but nobody will use unratified extensions unless you are in a very niche industry. The P extension is a good example here. The current proposal is a copy/paste of a proprietary extension a company is using. There may be people in their niche using their extension, but I don't see people jumping to add support anywhere (outside their own engineers).

There's a LOT to unpack about RVV. Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.

Differing performance across different performance levels is to be expected when RVV must scale from tiny DSPs up to supercomputers. As you point out, old atom cores (about the same as the Spacemit CPU) would have a different performance profile from a larger core. Even larger AMD cores have different performance characteristics with their tendency to like double-pumping AVX2/512 instructions (but not all of them -- just some).

In any case, it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction (and the wrong configuration at times). It seems obvious to me that the compiler will ultimately need to generate a handful of different code variants (shouldn't be a code bloat issue because only a tiny fraction of all code is SIMD) the dynamically choose the best variant for the processor at runtime.


> Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.

Packed SIMD not having LMUL means that hardware can't rely on it being used for high performance; whereas some of the theadvector hardware (which could equally apply to rvv1.0) already had VLEN=128 with 256-bit ALUs, thus having LMUL=2 have twice the throughput of LMUL=1. And even above LMUL=2 various benchmarks have shown improvements.

Having a compiler output multiple versions is an interesting idea. Pretty sure it won't happen though; it'd be a rather difficult political mess of more and more "please add special-casing of my hardware", and would have the problem of it ceasing to reasonably function on hardware released after being compiled (unless like glibc or something gets some standard set of hardware performance properties that can be updated independently of precompiled software, which'd be extra hard to get through). Also P-cores vs E-cores would add an extra layer of mess. There might be some simpler version of just going by VLEN, which is always constant, but I don't see much use in that really.


> it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction

+1 to dzaima's mention of vrgather. The lack of fixed-pattern shuffle instructions in RVV is absolutely a wrong-instruction issue.

I agree with your point that multiple code variants + runtime dispatch are helpful. We do this with Highway in particular for x86. Users only write code once with portable intrinsics, and the mess of instruction selection is taken care of.


> +1 to dzaima's mention of vrgather. The lack of fixed-pattern shuffle instructions in RVV is absolutely a wrong-instruction issue.

What others would you want? Something like vzip1/2 would make sense, but that isn't much of an permutation, since the input elements are exctly next to the output elements.


Going through Highway's set of shuffle ops:

64-bit OddEven/Reverse2/ConcatOdd/ConcatEven, OddEvenBlocks, SwapAdjacentBlocks, 8-bit Reverse, CombineShiftRightBytes, TableLookupBytesOr0 (=PSHUFB) and Broadcast especially for 8-bit, TwoTablesLookupLanes, InsertBlock, InterleaveLower/InterleaveUpper (=vzip1/2).

All of these are considerably more expensive on RVV. SVE has a nice set, despite also being VL-agnostic.


More RVV questionable optimization cases:

- broadcasting a loaded value: a stride-0 load can be used for this, and could be faster than going through a GPR load & vmv.v.x, but could also be much slower.

- reversing: could use vrgather (could do high LMUL everywhere and split into multiple LMUL=1 vrgathers), could use a stride -1 load or store.

- early-exit loops: It's feasible to vectorize such, even with loads via fault-only-first. But if vl=vlmax is used for it, it might end up doing a ton of unnecessary computation, esp. on high-VLEN hardware. Though there's the "fun" solution of hardware intentionally lowering vl on fault-onlt-first to what it considers reasonable as there aren't strict requirements for it.


Expanding on 3: I think it ends up at O(n^2 * log n) transistors, O(log n) critical path (not sure on routing or what fan-out issues might there be).

Basically: determine end of instruction at each byte (trivial but expensive). Determine end of two instructions at each byte via end2[i]=end[end[i]]. Then end4[i]=end2[end2[i]], etc, log times.

That's essentially log(n) shuffles. With 32-byte/cycle decode that's roughy five 'vpermb ymm's, which is rather expensive (though various forms of shortcuts should exist - for the larger layers direct chasing is probably feasible, and for the smaller ones some special-casing of single-byte instructions could work).

And, actually, given the mention of O(log n)-transistor shuffles at http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo..., it might even just be O(n * log^2(n)) transistors.

Importantly, x86 itself plays no part in the non-trivial part. It applies equqlly to the RISC-V compressed extension, just with a smaller constant.


Determining the end of a RISC-V instruction requires checking two bits and you have the knowledge that no instruction exceeds 4 bytes or uses less than 2 bytes.

x86 requires checking for a REX, REX2, VEX, EVEX, etc prefix. Then you must check for either 1 or 2 instruction bytes. Then you must check for the existence of a register byte, how many immediate byte(s), and if you use a scaled index byte. Then if a register byte exists, you must check it for any displacement bytes to get your final instruction length total.

RISC-V starts with a small complexity then multiplies it by a small amount. x86 starts with a high complexity then multiplies it by a big amount. The real world difference here is large.

As I pointed out elsewhere ARM's A715 dropped support for aarch32 (which is still far easier to decode than x86) and cut decoder size by 75% while increasing raw decoder count by 20%. The decoder penalties of bad ISA design extend beyond finding instruction boundaries.


I don't disagree that the real-world difference is massive; that much is pretty clear. I'm just pointing out that, as far as I can tell, it's all just a question of a constant factor, it's just massive. I've written half of a basic x86 decoder in regular imperative code, handling just the baseline general-purpose legacy encoding instructions (determines length correctly, and determines opcode & operand values to some extent), and that was already much.


> With 50+ years of figuring the basics out, RISC-V won't be making any major mistakes on the most important stuff.

RVV does have significant departures from prior work, and some of them are difficult to understand:

- the whole concept of avl, which adds complexity in many areas including reg renaming. From where I sit, we could just use masks instead.

- mask bits reside in the lower bits of a vector, so we either require tons of lane-crossing wires or some kind of caching.

- global state LMUL/SEW makes things hard for compilers and OoO.

- LMUL is cool but I imagine it's not fun to implement reductions, and vrgather.


How does avl affect register renaming? (there's the edge-case of vl=0 that is horrifically stupid (which is by itself a mistake for which I have seen no justification but whatever) but that's probably not what you're thinking of?) Agnostic mode makes it pretty simple for hardware to do whatever it wants.

Over masks it has the benefit of allowing simple hardware short-circuiting, though I'd imagine it'd be cheap enough to 'or' together mask bit groups to short-circuit on (and would also have the benefit of better masked throughput)

Cray-1 (1976) had VL, though, granted, that's a pretty long span of no-VL until RVV.


Was thinking of a shorter avl producing partial results merged into another reg. Something like a += b; a[0] += c[0]. Without avl we'd just have a write-after-write, but with it, we now have an additional input, and whether this happens depends on global state (VL).

Espasa discusses this around 6:45 of https://www.youtube.com/watch?v=WzID6kk8RNs.

Agree agnostic would help, but the machine also has to handle SW asking for mask/tail unchanged, right?


> Agree agnostic would help, but the machine also has to handle SW asking for mask/tail unchanged, right?

Yes, but it should rarely do so.

The problem is that because of the vl=0 case you always have a dependency on avl. I think the motivavtion for the vl=0 case was that any serious ooo implementation will need to predict vl/vtype anyways, so there might as well be this nice to have feature.

IMO they should've only supported ta,mu. I think the only usecase for ma, is when you need to avoid exceptions. And while tu is usefull, e.g. summing am array, it could be handled differently. E.g. once vl<vlmax you write the summ to a difgerent vector and do two reductions (or rather two diffetent vectors given the avl to vl rules).


What's the "nice to have feature" of vl=0 not modifying registers? I can't see any benefit from it. If anything, it's worse, due to the problems on reduce and vmv.s.x.


"nice to hace" because it removes the need for a branch for the n=0 case, for regular loops you probably still want it, but there are siturations were not needing to worry about vl=0 corrupting your data is somewhat nice.


Huh, in what situation would vl=0 clobbering registers be undesirable while on vl≥1 it's fine?

If hardware will be predicting vl, I'd imagine that would break down anyway. Potentially catastrophically so if hardware always chooses to predict vl=0 doesn't happen.


> Agree agnostic would help, but the machine also has to handle SW asking for mask/tail unchanged, right?

The agnosticness flags can be forwarded at decode-time (at the cost of the non-immediate-vtype vsetvl being very slow), so for most purposes it could be as fast as if it were a bit inside the vector instruction itself. Doesn't help vl=0 though.


Some notes: 1. Consider M1's 8-wide decoder hit the 5+ GHz clock speeds that Intel Golden Cove's decoder can. More complex logic with more delays is harder to clock up. Of course M1 may be held back by another critical path, but it's interesting that no one has managed to get a 8-wide Arm decoder running at the clock speeds that Zen 3/4 and Golden Cove can.

A715's slides say the L1 icache gains uop cache features including caching fusion cases. Likely it's a predecode scheme much like AMD K10, just more aggressive with what's in the predecode stage. Arm has been doing predecode (moving some stages to the L1i fill path rather than the hotter L1i hit path) to mitigate decode costs for a long time. Mitigating decode costs again with a uop cache never made much sense especially considering their low clock speeds. Picking one solution or the other is a good move, as Intel/AMD have done. Arm picked predecode for A715.

2. The paper does not say 22% of core power is in the decoders. It does say core power is ~22% of package power. Wrong figure? Also, can you determine if the decoder power situation is different on Arm cores? I haven't seen any studies on that.

3. Multiple decoder blocks doesn't penalize decoder blocks once the load balancing is done right, which Gracemont did. And you have to massively unroll a loop to screw up Tremont anyway. Conversely, decode blocks may lose less throughput with branchy code. Consider that decode slots after a taken branch are wasted, and clustered decode gets around that. Intel stated they preferred 3x3 over 2x4 for that reason.

4. "uops used by ARM are extremely close to the original instructions" It's the same on x86, micro-op count is nearly equal to instruction count. It's helpful to gather data to substantiate your conclusions. For example, on Zen 4 and libx264 video encoding, there's ~4.7% more micro-ops than instructions. Neoverse V2 retires ~19.3% more micro-ops than instructions in the same workload. Ofc it varies by workload. It's even possible to get negative micro-op expansion on both architectures if you hit branch fusion cases enough.

8. You also have to tell your ARM compiler which of the dozen or so ISA extension levels you want to target (see https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#inde...). It's not one option by any means. Not sure what you mean by "peephole heuristic optimizations", but people certainly micro-optimize for both arm and x86. For arm, see https://github.com/dotnet/runtime/pull/106191/files as an example. Of course optimizations will vary for different ISAs and microarchitectures. x86 is more widely used in performance critical applications and so there's been more research on optimizing for x86 architectures, but that doesn't mean Arm's cores won't benefit from similar optimization attention should they be pressed into a performance critical role.


> Not sure what you mean by "peephole heuristic optimizations"

Post-emit or within-emit stage optimization where a sequence of instructions is replaced with a more efficient shorter variant.

Think replacing pairs of ldr and str with ldp and stp, changing ldr and increment with ldr with post-index addressing mode, replacing address calculation before atomic load with atomic load with addressing mode (I think it was in ARMv8.3-a?).

The "heuristic" here might be possibly related to additional analysis when doing such optimizations.

For example, previously mentioned ldr, ldr -> ldp (or stp) optimization is not always a win. During work on .NET 9, there was a change[0] that improved load and store reordering to make it more likely that simple consecutive loads and stores are merged on ARM64. However, this change caused regressions in various hot paths because, for example, previously matched ldr w0, [addr], ldr w1, [addr+4] -> modify w0 -> str w0, [addr] pair got replaced with ldp w0, w1, [add] -> modify w0, str w0 [addr].

Turns out this kind of merging defeated store forwarding on Firestorm (and newer) as well as other ARM cores. The regression was subsequently fixed[1], but I think the parent comment author may have had scenarios like these in mind.

[0]: https://github.com/dotnet/runtime/pull/92768

[1]: https://github.com/dotnet/runtime/pull/105695


1. Why would you WANT to hit 5+GHz when the downsides of exponential power take over? High clocks aren't a feature -- they are a cope.

AMD/Intel maintain I-cache and maintain a uop cache kept in sync. Using a tiny part to pre-decode is different from a massive uop cache working as far in advance as possible in the hopes that your loops will keep you busy enough that your tiny 4-wide decoder doesn't become overwhelmed.

2. The float workload was always BS because you can't run nothing but floats. The integer workload had 22.1w total core power and 4.8w power for the decoder. 4.8/22.1 is 21.7%. Even the 1.8w float case is 8% of total core power. The only other argument would be that the study is wrong and 4.8w isn't actually just decoder power.

3. We're talking about worst cases here. Nothing stops ARM cores from creating a "work pool" of upcoming branches in priority order for them to decode if they run out of stuff on the main branch. This is the best of both worlds where you can be faster on the main branch AND still do the same branchy code trick too.

4. This is the tail wagging the dog (and something else if your numbers are correct). Complex x86 instructions have garbage performance, so they are avoided by the compiler. The problem is that you can't GUARANTEE those instructions will NEVER be used, so the mere specter of them forces complex algorithms all over the place where ARM can do more simple things.

In any case, your numbers raise a VERY interesting question about x86 being RISC under the hood.

Consider this. Say that we have 1024 bytes of ARM code (256 instructions). x86 is around 15% smaller (871.25 bytes) and with the longer 4.25 byte instruction average, x86 should have around 205 instructions. If ARM is generating 19.3% more uops than instructions, we have about 305 uops. x86 with just 4.7% more has 215 uops (the difference here is way outside any margins of error here).

If both are doing the same work, x86 uops must be in the range of 30% more complex. Given the limits of what an ALU can accomplish, we can say with certainty that x86 uops are doing SOMETHING that isn't the RISC they claim to be doing. Perhaps one could claim that x86 is doing some more sophisticated instructions in hardware, but that's a claim that would need to be substantiated (I don't know what ISA instructions you have that give a 15% advantage being done in hardware, but aren't already in the ARM ISA and I don't see ARM refusing to add circuitry for current instructions to the ALU if it could reduce uops by 15% either).

8. https://en.wikipedia.org/wiki/Peephole_optimization

The final optimization stage is basically heuristic find & replace. There could in theory be a mathematically provable "best instruction selection", but finding it would require trying every possible combination which isn't possible as long as P=NP holds true.

My favorite absurdity of x86 (though hardly the only one) is padding. You want to align function calls at cacheline boundaries, but that means padding the previous cache line with NOPs. Those NOPs translate into uops though. Instead, you take your basic, short instruction and pad it with useless bytes. Add a couple useless bytes to a bunch of instructions and you now have the right length to push the function over to the cache boundary without adding any NOPs.

But the issues go deeper. When do you use a REX prefix? You may want it so you can use 16 registers, but it also increases code size. REX2 with APX is going to increase this issue further where you must juggle when to use 8, 16, or 32 registers and when you should prefer the long REX2 because it has 3-register instructions. All kinds of weird tradeoffs exist throughout the system. Because the compilers optimize for the CPU and the CPU optimizes for the compiler, you can wind up in very weird places.

In an ISA like ARM, there isn't any code density weirdness to consider. In fact, there's very little weirdness at all. Write it the intuitive way and you're pretty much guaranteed to get good performance. Total time to work on the compiler is a zero-sum game given the limited number of experts. If you have to deal with these kinds of heuristic headaches, there's something else you can't be working on.


> My favorite absurdity of x86 (though hardly the only one) is padding. You want to align function calls at cacheline boundaries, but that means padding the previous cache line with NOPs. Those NOPs translate into uops though.

I'd call that more neat than absurd.

> You may want it so you can use 16 registers, but it also increases code size.

RISC-V has the exact same issue, some compressed instructions having only 3 bits for operand registers. And on x86 for 64-bit-operand instructions you need the REX prefix always anyways. And it's not that hard to pretty reasonably solve - just assign registers by their use count.

Peephole optimizations specifically here are basically irrelevant. Much of the complexity for x86 comes from just register allocation around destructive operations (though, that said, that does have rather wide-ranging implications). Other than that, there's really not much difference; all have the same general problems of moving instructions together for fusing, reordering to reduce register pressure vs putting parallelizable instructions nearer, rotating loops to reduce branches, branches vs branchless.


RISC-V has a different version of this issue that is pretty straight-forward. Preferring 2-register operations is already done to save register space. The only real extra is preferring the 8 registers C uses for math. After this, it's all just compression.

x86 has a multitude of other factors than just compression. This is especially true with standard vs REX instructions because most of the original 8 instructions have specific purposes and instructions that depend on them for these (eg, Accumulator instructions with A register, Mul/div using A+D, shift uses C, etc). It's a problem a lot harder than simple compression.

Just as cracking an alphanumeric password is exponentially harder than a same-length password with numbers only, solving for all the x86 complications and exceptions is also exponentially harder.


If anything, I'd say x86's fixed operands make register allocation easier! Don't have to register-allocate that which you can't. (ok, it might end up worse if you need some additional 'mov's. And in my experience more 'mov's is exactly what compilers often do.)

And, right, RISC-V even has the problem of being two-operand for some compressed instructions. So the same register allocation code that's gone towards x86 can still help RISC-V (and vice versa)! On RISC-V, failure means 2→4 bytes on a compressed instruction, and on x86 it means +3 bytes of a 'mov'. (granted, the additioanal REX prefix cost is separate on x86, while included in decompression on RISC-V)


With 16 registers, you can't just avoid a register because it has a special use. Instead, you must work to efficiently schedule around that special use.

Lack of special GPRs means you can rename with impunity (this will change slightly with the load/store pair extension). Having 31 truly GPR rather than 8 GPR+8 special GPR also gives a lot of freedom to compilers.


Function arguments and return values already are effectively special use, and should frequently be on par if not much more frequent than the couple x86 instructions with fixed registers.

Both clang and gcc support calls having differing used calling conventions within one function, which ends up effectively exactly identical to fixed-register instructions (i.e. an x86 'imul r64' can be done via a pseudo-function where the return values are in rdx & rax, an input is in rax, and everything else is non-volatile; and the dynamically-choosable input can be allocated separately). And '__asm__()' can do mixed fixed and non-fixed registers anyway.


Unlike x86, none of this is strictly necessary. As long as you put things back as expected, you may use all the registers however you like.


The option of not needing any fixed register usage would apply to, what, optimizing compilers without support for function calls (at least via passing arguments/results via registers)? That's a very tiny niche to use as an argument for having simplified compiler behavior.

And good register allocation is still pretty important on RISC-V - using more registers, besides leading to less compressed instruction usage, means more non-volatile register spilling/restoring in function prologue/epilogue, which on current compilers (esp. clang) happens at the start & end of functions, even in paths that don't need the registers.

That said, yes, RISC-V still indeed has much saner baseline behavior here and allows for simpler basic register allocation, but for non-trivial compilers the actual set of useful optimizations isn't that different.


Not just simpler basic allocation. There are fewer hazards to account for as well. The process on RISC-V should be shorter, faster, and with less risk that the chosen heuristics are bad in an edge case.


1. Performance. Also Arm implemented instruction cache coherency too.

Predecode/uop cache are both means to the same end, mitigating decode power. AMD and Intel have used both (though not on the same core). Arm has used both, including both on the same core for quite a few generations.

And a uop cache is just a cache. It's also big enough on current generations to cache more than just loops, to the point where it covers a majority of the instruction stream. Not sure where the misunderstanding of the uop cache "working as far in advance is possible" comes from. Unless you're talking about the BPU running ahead and prefetching into it? Which it does for L1i, and L2 as well?

2. "you can't run nothing but floats" they didn't do that in the paper, they did D += A[j] + B[j] ∗ C[j]. Something like matrix multiplication comes to mind, and that's not exactly a rare workload considering some ML stuff these days.

But also, has a study been done on Arm cores? For all we know they could spend similar power budgets on decode, or more. I could say an Arm core uses 99% of its power budget on decode, and be just as right as you are (they probably don't, my point is you don't have concrete data on both Arm and x86 decode power, which would be necessary for a productive discussion on the subject)

3. You're describing letting the BPU run ahead, which everyone has been doing for the past 15 years or so. Losing fetch bandwidth past a taken branch is a different thing.

4. Not sure where you're going. You started by suggesting Arm has less micro-op expansion than x86, and I provided a counterexample. Now you're talking about avoiding complex instructions, which a) compilers do on both architectures, they'll avoid stuff like division, and b) humans don't in cases where complex instructions are beneficial, see Linux kernel using rep movsb (https://github.com/torvalds/linux/blob/5189dafa4cf950e675f02...), and Arm introducing similar complex instructions (https://community.arm.com/arm-community-blogs/b/architecture...)

Also "complex" x86 instructions aren't avoided in the video encoding workload. On x86 it takes ~16.5T instructions to finish the workload, and ~19.9T on Arm (and ~23.8T micro-ops on Neoverse V2). If "complex" means more work per instruction, then x86 used more complex instructions, right?

8. You can use a variable length NOP on x86, or multiple NOPs on Arm to align function calls to cacheline boundaries. What's the difference? Isn't the latter worse if you need to move by more than 4 bytes, since you have multiple NOPs (and thus multiple uops, which you think is the case but isn't always true, as some x86 and some Arm CPUs can fuse NOP pairs)

But seriously, do try gathering some data to see if cacheline alignment matters. A lot of x86/Arm cores that do micro-op caching don't seem to care if a function (or branch target) is aligned to the start of a cacheline. Golden Cove's return predictor does appear to track targets at cacheline granularity, but that's a special case. Earlier Intel and pretty much all AMD cores don't seem to care, nor do the Arm ones I've tested.

Anyway, you're making a lot of unsubstantiated guesses on "weirdness" without anything to suggest it has any effect. I don't think this is the right approach. Instead of "tail wagging the dog" or whatever, I suggest a data-based approach where you conduct experiments on some x86/Arm CPUs, and analyze some x86/Arm programs. I guess the analogy is, tell the dog to do something and see how it behaves? Then draw conclusions off that?


1. The biggest chip market is laptops and getting 15% better performance for 80% more power (like we saw with X Elite recently) isn't worth doing outside the marketing win of a halo product (a big reason why almost everyone is using slower X Elite variants). The most profitable (per-chip) market is servers. They also prefer lower clocks and better perf/watt because even with the high chip costs, the energy will wind up costing them more over the chip's lifespan. There's also a real cost to adding extra pipeline stages. Tejas/Jayhawk cores are Intel's cancelled examples of this.

L1 cache is "free" in that you can fill it with simple data moves. uop cache requires actual work to decode and store elements for use in addition to moving the data. As to working ahead, you already covered this yourself. If you have a nearly 1-to-1 instruction-to-uop ratio, having just 4 decoders (eg, zen4) is a problem because you can execute a lot more than just 4 instructions on the backend. 6-wide Zen4 means you use 50% more instructions than you decode per clock. You make up for this in loops, but that means while you're executing your current loop, you must be maxing out the decoders to speculatively fill the rest of the uop cache before the loop finishes. If the loop finishes and you don't have the next bunch of instructions decoded, you have a multi-cycle delay coming down the pipeline.

2. I'd LOVE to see a similar study of current ARM chips, but I think the answer here is pretty simple to deduce. ARM's slide says "4x smaller decoders vs A710" despite adding a 5th decoder. They claim 20% reduction in power at the same performance and the biggest change is the decoder. As x86 decode is absolutely more complex than aarch32, we can only deduce that switching from x86 to aarch64 would be an even more massive reduction. If we assume an identical 75% reduction in decoder power, we'd move from 4.8w on haswell the decoder down to 1.2w reducing total core power from 22.1 to 18.5 or a ~16% overall reduction in power. This isn't too far from to the power numbers claimed by ARM.

4. This was a tangent. I was talking about uops rather than the ISA. Intel claims to be simple RISC internally just like ARM, but if Intel is using nearly 30% fewer uops to do the same work, their "RISC" backend is way more complex than they're admitting.

8. I believe aligning functions to cacheline boundaries is a default flag at higher optimization levels. I'm pretty sure that they did the analysis before enabling this by default. x86 NOP flexibility is superior to ARM (as is its ability to avoid them entirely), but the cause is the weirdness of the x86 ISA and I think it's an overall net negative.

Loads of x86 instructions are microcode only. Use one and it'll be thousands of cycles. They remain in microcode because nobody uses them, so why even try to optimize and they aren't used because they are dog slow. How would you collect data about this? Nothing will ever change unless someone pours in millions of dollars in man-hours into attempting to speed it up, but why would anyone want to do that?

Optimizing for a local maxima rather than a global maxima happens all over technology and it happens exactly because of the data-driven approach you are talking about. Look for the hot code and optimize it without regard that there may be a better architecture you could be using instead. Many successes relied on an intuitive hunch.

ISA history has a ton of examples. iAPX432 super-CISC, the RISC movement, branch delay slots, register windows, EPIC/VLIW, Bulldozer's CMT, or even the Mill design. All of these were attempts to find new maxima with greater or lesser degrees of success. When you look into these, pretty much NONE of them had any real data to drive them because there wasn't any data until they'd actually started work.


1. Yeah I agree, both X Elite and many Intel/AMD chips clock well past their efficiency sweet spot at stock. There is a cost to extra pipeline stages, but no one is designing anything like Tejas/Jayhawk, or even earlier P4 variants these days. Also P4 had worse problems (like not being able to cancel bogus ops until retirement) than just a long pipeline.

Arm's predecoded L1i cache is not "free" and can't be filled with simple data moves. You need predecode logic to translate raw instruction bytes into an intermediate format. If Arm expanded predecode to handle fusion cases in A715, that predecode logic is likely more complex than in proir generations.

2. Size/area is different from power consumption. Also the decoder is far from the only change. The BTBs were changed from 2 to 3 level, and that can help efficiency (could make a smaller L2 BTB with similar latency, while a slower third level keeps capacity up). TLBs are bigger, probably reducing page walks. Remember page walks are memory accesses and the paper earlier showed data transfers count for a large percentage of dynamic power.

4. IMO no one is really RISC or CISC these days

8. Sure you can align the function or not. I don't think it matters except in rare corner cases on very old cores. Not sure why you think it's an overall net negative. "feeling weird" does not make for solid analysis.

Most x86 instructions are not microcode only. Again, check your data with performance counters. Microcoded instructions are in the extreme minority. Maybe microcoded instructions were more common in 1978 with the 8086, but a few things have changed between then and now. Also microcoded instructions do not cost thousands of cycles, have you checked? i.e. a gather is ~22 micro ops on Haswell, from https://uops.info/table.html Golden Cove does it in 5-7 uops.

ISA history has a lot of failed examples where people tried to lean on the ISA to simplify the core architecture. EPIC/VLIW, branch delay slots, and register windows have all died off. Mill is a dumb idea and never went anywhere. Everyone has converged on big OoO machines for a reason, even though doing OoO execution is really complex.

If you're interested in cases where ISA does matter, look at GPUs. VLIW had some success there (AMD Terascale, the HD 2xxx to 6xxx generations). Static instruction scheduling is used in Nvidia GPUs since Kepler. In CPUs ISA really doesn't matter unless you do something that actively makes an OoO implementation harder, like register windows or predication.


That was true when ARM was first released, but over the years the decoder for ARM has gotten more and more complicated. Who would have guessed adding more specialized instructions would result in more complicated decoders? ARM now uses multi-stage decoders, just the same as x86.


Sure, but it's not idle power consumption that's the difference between these.


When a laptop gets 12 hours or more of battery life that's because it's 90% idle.


And while it's important to design a chip that can enter a deep idle state, the thing that differentiates one Windows laptop from the next is how many mistakes the BIOS writers made and whether the platform drivers work correctly. This is also why you cannot really judge the expected battery life under Linux by reading reviews of laptops running Windows.


I didn’t watch this link, but my Zenbook S 16 only gets remotely close to my M2 MBA battery life if the zenbook is in whatever is Windows 11 ‘efficiency’ mode, and then it benchmarks at 50% of the M2.

I don’t think the two are remotely comparable in perf/watt.


Unlike AMD and Qualcomm, Apple uses an expensive TSMC 3nm process, so you would expect better battery life from the "MBP3". I assume they used the process improvements to increase performance instead.


Perf per watt is higher for M1 on N5 vs Zen5 on N4P, so the problems go deeper than just process.

X Elite also beats AMD/Intel in perf/watt while being on the same N4P node as HX370.

https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...


Performance per watt also depends on clock speed. Other things equal, higher clock speed means worse performance per watt.


The display, RAM, and other peripherals are consuming power too. Short of running continuous high CPU loads, which most people don't do on laptops, changes in CPU efficiency have less apparent effect on battery life because it's only a fraction of overall power draw.


> within an hour of MBP3

Not a good way to measure. The Zenbook S16 has a larger 78Wh battery vs the MacBook Pro’s 69.6Wh.

So that’s 11% less battery life despite 12% more battery capacity.


Yeah if you make a worse core and then downclock it then you will increase power efficiency. AMD thankfully only downclocks the 5c, but Intel is shipping ivy lake equivalents in their flagship products just to get power efficiency up.


These are interesting results and make a strong case for AMD over Qualcomm, especially in battery life:

  ASUS S16 (Qualcomm):    13h 39m  
  Apple MacBook Pro (M3): 12h 35m  
  ASUS S16 (AMD):         11h 10m  
  ASUS S15 (Intel):        9h 15m  
Cinebench Multicore:

  ASUS S16 (Qualcomm):   1,138  
  ASUS S16 (AMD):          997  
  Apple MacBook Pro (M3):  716  
Cinebench Single core:

  Apple MacBook Pro (M3): 141  
  ASUS S16 (AMD):         113  
  ASUS S16 (Qualcomm):    108


They do make 10 TB+ SSDs:

https://www.allhdd.com/kioxia-kcmyxrug15t3-15.36tb-ssd

Infact, Kioxia's enterprise line goes up to 30 TB/drive.


You linked a 15.3TB drive which doesn't dispute his point. 10TB has never been a manufactured SSD size that I'm aware of. In the enterprise we've got approximately 960GB, 1.92TB, 3.84TB, 7.6TB, 15.3TB, 32TB.

10 is skipped entirely, unless you can provide an example of any reputable manufacturer producing one.

Consumer drives don't follow that exactly, but consumer SSDs also don't hit 10TB+.


Exactly. I didn't say 10TB was too big, I said it was an odd size. There's no easy way to get close to 10TB when using components that are sized by powers of two, plus or minus varying amounts of overprovisioning depending on market segment, and GB vs GiB differences. ~8TB SSDs are common in consumer and enterprise markets. 16TB drives that expose 15.36TB usable space are common in enterprise, and 12.8TB usable space from ~16TB raw flash isn't unheard of. 10TB usable space isn't theoretically impossible, but it simply wouldn't make sense.


> There's no easy way to get close to 10TB when using components that are sized by powers of two...

I expect you can get any usable capacity you like by reserving some subset of the flash for onboard spare/scratch space. I think I remember long ago Anandtech doing some benchmarking the changing performance of some drives as they adjusted the size of this "housekeeping" section of the drive. No clue if it's adjustable on every drive, but it sure was on the ones they were testing.


Most drives don't have any special functionality for adjusting overprovisioning. You just don't touch a large chunk of the LBA space and you get more or less the same effect. Leaving part of the drive unpartitioned, or creating a partition but not putting a filesystem in it will accomplish that purpose.

Drive vendors can tweak this in firmware to make the drive appear to have lower accessible capacity (or higher, for fraudulent drives). But as I've said several times, doing so to make a 10TB product would not make sense. The drives that expose a 12.8TB usable capacity from 16TB of flash already have far more overprovisioning than almost anybody needs. Further reducing that to 10TB would be throwing away capacity for little or no performance gain and a useless improvement to write endurance. It's not a product any rational, non-fraudulent vendor would create, because there's no demand for such a strange configuration. The fact that it's theoretically possible to create such a product does not actually make a 10TB SSD less suspicious.

(Side note: you don't have to tell me about what Anandtech tested with SSDs. Been there, done that.)


There's no easy way to get close to 10TB when using components that are sized by powers of two

5 x 2TB? 16Tbit (2TB) NAND flash is available these days.

This old Intel SSD is 10 x 16GB: https://www.storagereview.com/wp-content/uploads/2010/04/int...

SSDs aren't sized in powers of 2 anymore. Even the flash itself isn't due to things like spare area (and TLC flash is internally actually a multiple of 3 times a power of 2 size.)


Doesn't microsoft support eBPF on Windows?

https://github.com/microsoft/ebpf-for-windows


If you know the answer why are you asking the question?


No. Not in production yet. But that should solve this problem once it's available for any company that uses it (and I believe CrowdStrike is heavily involved with it).


I'm guessing H100 has 2x host energy overhead for connecting those GPUs? That might offset some of the perf/W benefits of nvidia's offering.


> There is no HSR happening.

https://hsr.ca.gov/

SF-LA high-speed rail


Why not link to the wikipedia page: https://en.wikipedia.org/wiki/California_High-Speed_Rail which states:

" Since its inception in 2008, the Group has issued 18 letters and members have testified before Legislative and Congressional committees 15 times. In reviewing past letters and testimony, a consistent theme emerges: 1) project costs, schedules, and ridership estimates are uncertain and subject to significant risk of deteriorating, a typical experience for mega-projects; 2) the project is underfunded, and its financing is unstable, raising costs and making effective management difficult if not impossible; 3) more legislative oversight is needed. This letter reinforces the message, but with a sense of urgency over the ever-higher stakes.[6] "

Also: "Per the 2023 Project Status Report, the authority indicated the Interim IOS will go into service before December 31, 2030, with a "risk factor" of three years (which runs until the end of 2033)."

In the period between 2008-2023, several other countries (not just China) have started and completed many HSR projects.


Why should I link the wikipedia page? The current link perfectly rebuts OP's claim.


Modern Android processors are pretty competitive with iPhone's silicon:

S23, 1.4k (Single core), 4.8k (Multi core), 79.3 (3d mark/GPU)

14 pro max, 1.8k (Single core), 5.3k (Multi core), 74 (3d mark/GPU)

Looks like gen2 is already on-par in multi-core and GPU perf. Single core still is 28% faster in iphone.

---

https://www.tomsguide.com/news/galaxy-s23-ultra-vs-iphone-14...


Yeah but the Vision Pro rocks an M2 chip, not the A-Series in the iPhone.


And it uses all that power to run... iPad apps.


If you are using a US SIM card, e.g., all your traffic is tunneled to the US. That could be the reason.


I wonder how hard it would be to replace FUSE with userspace block driver. That seems to be focused on performance.

https://lwn.net/Articles/903855/


Isn't that implementing a block device, so the level below the filesystem?


Yeah. That said, I've become disillusioned with FUSE over the course of writing this. FUSE is... not great, it's just what we're stuck with. Suggestions as to alternatives to FUSE would be welcome.


I'd expect bloatware to be unnecessary software. To a user who is not familiar with google's offering, the bloat doesn't exist, they don't see two different softwares. Samsung phones don't have two messaging apps, gallery, or phone app

Not including google's version doesn't equate to bloat. In fact, the opposite is true.


If I have to resort to dev mode to remove things I don't need, then to me it's bloat. Those things aren't there in vanilla Android.

I'm still using Samsung hardware and I like it, but it's got a little bloat goin on


Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: