Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ARM or x86? ISA Doesn’t Matter (2021) (chipsandcheese.com)
222 points by NavinF on May 14, 2023 | hide | past | favorite | 199 comments


The x86 decoders consume a reasonable amount of power, but the trouble is making them wider without affecting that.

I have an AMD CPU. Zen CPUs come with a fairly wide backend. But the frontend is what it is (especially early Zen), and without SMT it's essentially impossible to keep all those execution units fed. It's not that 8 x86 decoders wouldn't be a benefit, it's just that more decoders isn't cheap in x86 cores, each extra decoder is a serious cost.

If you compare with the big ARM cores, having a wide frontend is not a complex research problem or an impractical cost. 8 wide ARM decode is completely practical. You even have open source superscalar RISC-V cores just publicly available on Github running on FPGAs with 8 wide decode. Large frontends are (relatively) cheap and easy, if you're not x86.

So when we notice that the narrower x86 CPU's decode doesn't consume that much (a "drop in the ocean"), that's because it was designed narrower to keep the PPA reasonable! The reason I can't feed my Zen backend isn't because having a wide frontend is useless and I should just enable SMT anyways, it's because x86 makes wide decodes much less practical than competing architectures.


Acronym/initialism definer here. There are two in your comment that I wasn't familiar with, although I understood them once I saw them spelled out.

SMT is simultaneous multithreading, which may be more familiar under Intel's name of hyper-threading.

https://en.wikipedia.org/wiki/Simultaneous_multithreading

PPA is not the Professional Photographers of America, nor the Professional Pickleball Association, nor the Philadelphia Parking Authority, all of which came up at the top of my naïve search.

It's Power-Performance-Area.

https://en.wikichip.org/wiki/power-performance-area


Calling SMT as Hyper-threading, it's like calling all x86 CPUs as Pentiums.


That's the opposite of what I said. I used hyper-threading as an example of SMT.

A better analogy would be to say "Pentium is an example of an x86 CPU."


Hyper-threading is a misnomer imho. It should have been called "poor man's threading".


It's a marketing term, so clarity wasn't the goal, but it did make some sense in that it's not just threading in software, but enables the hardware to run multiple threads sometimes. Don't forget that it was primarily introduced in CPUs that were single-core.


Letting multiple threads share execution resources is a good strategy for enabling wider execution resources without underutilization if a thread hits a latency-bound section or a branch. It's wasteful to have resources sitting unutilized in one core while another thread could be running.

Like you are totally welcome to turn off SMT on most AMD and Intel motherboards (including some workstations I've seen etc - they even have "one core only" modes if you wish!). But it's a performance benefit in most situations, and it's good performance benefit relative to the cost/etc (compared to twice as many cores/cache/etc). It's just higher PPA than a non-HT design, in compute-limited scenarios.

I actually would be curious to see it on the Apple silicon too! They seem to have a very wide frontend/backend and maybe it could do a bulldozer-style switching the decoder between threads.

But in general there is indeed the sort of latency/QOS vs throughput tradeoff there. Server processing architectures go even higher on resource sharing - Sun Niagara series went to 8-wide CMT times 8 modules (8C64 threads). That's totally fine for some database workloads and they've specifically optimized their software to run well on it and utilize all the threads. And it's cheaper in your licensing too, wow how unpredicted!

POWER9 went to SMT8 on the performance cores as well. If you want to build a big fat core, it's hard to keep it utilized, and the inherent tradeoff is... just more threads.

Xeon Phi (Knight's Landing) is an interesting precedent in this thesis too! It kind of takes the bulldozer idea even further and does SMT4 and also AVX-512, but you get 54 of these P6Pro-tier SMT4 cores with AVX-512 bolted on. I know people view it as a descendant of Larrabee but it's interesting as a wide-SMT parallel processor as well, it's broadly comparable to something like Niagara in some respects.


SMT is not just hyper-threading, it also includes having multiple CPUs, including sockets and cores.


No, that's SMP. SMT is specifically multiple instruction pointers (threads) on a single core, i.e Hyperthreading


(Warning: super pedantic)

Actually, what akvadrako described is just called multiprocessing - not SMP.

SMP refers specifically to systems with uniform latency to shared memory. So any distributed memory topology like ccNUMA w/ interconnect would not fit that definition.


Very good point NUMA != SMP. I Didn't catch that difference before.


"No, that's SMP" - is perhaps too resolute a definition. I'd say SMP is across processes, SMT is within a process, which would put the distinction in software, not hardware architecture.

Definitions are more grey, muddy and dependent on context; in that line of thought, I feel the parent you're responding to is accurate enough to not warrant criticism or even downvotes.


No, these things are really quite clearly defined, and certainly not so muddy to include something like "which would put the distinction in software, not hardware architecture.". These are hardware architecture terms.


It's a trade-off between problems. Variable length instructions are not as trivial to decode wide, so you need more cleverness here. However, fixed length instructions decrease code density, which asks more of the instruction cache. Note Zen4 has a 32 KB L1 instruction cache while the M1 has a 192 KB L1 instruction cache, requiring extra cleverness here instead to handle the higher latency and area. Meanwhile, micro-op caches hide both problems.

The are ripple effects to consider as well. The large L1 caches of M1 (320 KB total) put capacity pressure on L2, towards larger sizes and/or away from inclusive policy. See the 12MB shared L2. Meanwhile, the narrower decode of Zen4 puts pressure on things like branch prediction accuracy & mispredict correction latency - if you predicted the wrong codepath, you can't catch up as quickly. See the large branch predictors on Zen4.


> Note Zen4 has a 32 KB L1 instruction cache while the M1 has a 192 KB L1 instruction cache, requiring extra cleverness here instead to handle the higher latency and area.

There's another factor here: to have a low latency, the L1 cache has to be indexed by the bits which don't change when translating from virtual addresses to physical addresses. That makes it harder to have a larger low-latency L1 cache when the native page size is 4KiB (AMD/Intel) instead of 16KiB (Apple M1/M2).

That is, most of the "cleverness" allowing for a larger L1 instruction cache is simply a larger page size.


Pretty sure everyone uses a VIPT L1, what evidence is there that it's PIPT?

(also what do you believe happens with 4k pages on M1, since it does support those, or why did the A7 have larger caches than Zen4 a decade ago, which was well before 16k pages)


If the instructions are on average 3-4x less dense (take about that much more space) than the trade-off in associativity granularity and corresponding increase in cache size are logical. The logical management circuits would be around the same size, though the number of memory cells and corresponding cost in silicon, power / thermal, and signal propagation / layout issues remain.


Since its instructions have to be aligned, ARM would also have only 12 usable address bits per page.


x86 also uses 2-address instructions which means that you often need to use moves between registers (additional instructions), example: [1]. ARM uses 3-address instructions.

Also, x86 code is compact, but not as compact as in era of 8080 [2] - here addition and multiplication require 3 bytes each, 6 bytes total. To my surprise, ARM has an add-multiply instruction and it uses just 4 bytes (instead of 8) [3].

And RISC-V uses 6 bytes because of shortened instruction for addition [4]

Of course, this simple function cannot be a replacement for proper analysis, but it seems that x86 code is not significantly denser.

Also to my great disappoitment none of those CPUs has checked overflow for arithmetic operation.

[1] https://godbolt.org/z/jsoccE5jv

[2] https://godbolt.org/z/jTMs1MEzh

[3] https://godbolt.org/z/nGb8qKcxe

[4] https://godbolt.org/z/x9c115crY


Of course, this simple function cannot be a replacement for proper analysis, but it seems that x86 code is not significantly denser.

Look at the demoscene for an example of dense x86 code, especially in the sub-1K categories. They routinely achieve code densities for x86 that no compilers I know of can get close to, and AFAIK I have not seen the same happen with ARM, nor MIPS or any other well-known RISC.

3-address instructions improve code density only in situations where both source operands are needed later.


They use a lot of the original 8087 FP instructions (very dense because it's basically a bytecode stack machine). Plus tricks like deriving constants from bytes in the code segment. And you can assume the contents of registers when you enter the code.

Pyrit, a ray tracing demo in 256 bytes, does all of that: https://www.pouet.net/prod.php?which=78045

You probably wouldn't want your general purpose compiler doing this sort of thing! The resulting code would be suboptimal and fragile.


What percentage of code that a CPU will run over its lifetime is demoscene code? Heck, even of just simple hand-optimized assembly a CPU is likely to encounter, what percentage is not vector code? Because x86 vector code typically averages more than 4 bytes per instruction, and I have a suspicion that at least five nines of scalar instructions a CPU executes were generated by a compiler.


I mention that to point out the code density limits of x86 are much higher than what measurements using compiler output will show, while on the other hand I haven't seen the same for ARM and suspect that one can't really get much better than compiler output for it or other RISCs.

Having had to patch binaries on multiple occasions by inserting instructions, it is definitely not hard to do so for x86 as one can easily find "slack" that the compiler left behind[1], but I once had to do it for a MIPS binary, and it was definitely not easy to squeeze in the few extra instructions I needed inline; I ended up having to detour to another area with jumps instead.

Here's an old paper where the authors tried to optimise for code density manually, and you can consistently see x86 beating ARM and MIPS:

https://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_den...

[1] See https://news.ycombinator.com/item?id=15720923 for an example.


Yeah if code size is the only metric you care about. The second link is an excellent example of code you do not want a compiler to generate by default. Like, besides all the well-known performance pitfalls of microcoded instructions, jeczx is unfusable on I think all relevant CPUs, so it’s both an additional uop and an additional cycle of latency over a tst/jz sequence.


Five nines is really high. I don't think this is true, probably because language runtimes have hot paths that are typically implemented by hand. If we drop "scalar" then of course you're dropping below even two nines because of the implementation of str* and mem*.


x86-64 is not even a particularly compact encoding even compared to contemporary fixed length encodings. The inherent advantage of variable length encoding is largely cancelled out by wasted encoding space for legacy cruft.

Aarch64 is roughly on par for encoding density.


> Also to my great disappoitment none of those CPUs has checked overflow for arithmetic operation.

Yes - we're stuck with the historically convenient collection of operations and types. Our CPUs are amazingly fast but very little smarter than decades ago.

Backwards-compatibility pretty much locks us into the same logical space. At least when people were designing new and amazing computers using actual wire (!) they felt free to implement what they thought would be useful, not just copy the existing model.


The only CPU I know which had integer overflow detection was MIPS. RISC V is somewhat based on MIPS but they removed this part :-(


Indeed it is weird to see the trend in hardware to remove overflow checking when the trend in programming languages is to have more.

While MIPS can trap on overflow, most architectures can at least set flags for conditional branch. PowerPC OR's the overflow flag into another flag, so you can choose to check that flag for overflow first after a block of calculations.

I really want Mill to become successful. It has "NaR" (Not A Result) - like NaN but for integers. For every arithmetic instruction, there is has a variant that can set NaR. Like a NaN it is carried through the data-flow, and there's a trap first when it reaches a store or conditional instruction.


6502/6510 had integer add overflow …


Modern CPUs are plenty fast at doing overflow checks in software.


> To my surprise, ARM has an add-multiply instruction

Fused multiply-add is (https://en.m.wikipedia.org/wiki/Multiply%E2%80%93accumulate_...) is extremely useful and part of IEEE754-2008.

Because of that, it would surprise me if a modern CPU with floating point wouldn’t have it.

> And RISC-V uses 6 bytes because of shortened instruction for addition

That Wikipedia page claims RISC-V has fused multiply-add, so that may be compiler inefficiency. An answer to https://stackoverflow.com/questions/57248403/why-fused-multi... says gcc can generate it at -O3)


I think the parent poster might have been referring to the integer multiply-add instruction.

In fact, 64-bit ARM does not have a regular MUL instruction. "MUL" is an alias to MADD with the zero register as addend.

BTW. There is also no REM instruction. Flipping one parameter bit in the instruction word of an MADD makes the unit negate the product, changing it into a MSUB. That instruction is intended for computing the remainder after a DIV. So, when you need only a remainder there are eight instruction bytes — and high latency on small cores. Bigger ARM cores (such as Apple M1) perform macro-op fusion of DIV and MSUB to deliver both quotient and remainder in one operation.


It's incredible how large the code is when you compile some trivial piece of C with GCC for x86 or amd64, compared to one's memory of working with C on 386 boxes.

What? I just wrote a simple main and a couple of small helper functions. How can we be past 20 kB of .text section?


Can you share a sample with Godbolt? It might foster some good discussion here.


ARM used to find itself on the wrong side of this tradeoff in the era of 4-wide x86 decode units and 4-6 wide ARM decoders. They lost too much perf to cache size for the decoder width to make up for it.

It's unclear to me if they will pull ahead on the perf/area game with the era of 8-wide x86 decoders coming.


The M1 uses 16kiB pages to reduce TLB pressure and get more non-aliasing cache tag bits.


Can use 16k pages. M1 also can use 4k pages.


> Variable length instructions are not as trivial to decode wide, so you need more cleverness here. However, fixed length instructions decrease code density

This is true in theory but I question how much it applies to current CPUs. A problem I see is that the x86 instruction set was designed in the 1970s with corresponding assumptions about instruction frequency, speed of memory, number of CPU cycles, pipelining (or lack thereof), addressing modes, etc. Even if the variable-length encoding was close to optimal back then, there's no way it's remotely optimal given all the extensions to the instruction set, the change from hand-coded assembly to compiler-generated code, and many more seismic shifts in the ISA. So I think x86's variable-length instructions now have all the downsides of complicated decoding as well as the downsides of suboptimal code density. But you also can't reshuffle all the instructions and addressing modes without breaking backward compatibility.


> However, fixed length instructions decrease code density, which asks more of the instruction cache.

The problem is that x86 has a lot of legacy that wastes the potential gain of variable length encoding. If I remember correctly, the code density of 64-bit x86 isn't really significantly better than 64-bit ARM, despite ARM dropping support for Thumb.

RISC-V is going to be interesting in that regard because with the latest revisions of the compressed instruction format, it can potentially beat both ARM and x86 on real-world code density, while still being easier to decode.


Having a wide backend which cannot be fully utilized is common in non-x86 chips as well. The backend units are specialized while the front end less so, so to sustain the front-end bandwidth for many instruction mixed you need a wider backend so you have enough units of the right type.

Wide x86 decode definitely has a cost, but I don't think it's the primary limiter as you make it out to be: for several x86 generations the narrowest bottleneck has been the renamer which has crept up from 3 to 6 in a glacial manner. Admittedly rename is also complicated on x86 due to memory source/destinations, 3 input instructions, 2 output instructions, etc.


The decoders are largely irrelevant. All modern x86 machines use a uop cache. The large majority of instructions hit in this cache and do not need to hit the decoders at all. As a result, the decoders can spend much of their time shut down. You already have four-ish decoders that are idle most of the time; why do you want eight-ish decoders that are idle nearly all the time?

No one would design an ISA like x86 these days. It definitely does use more power and more die area than strictly needed. It definitely does reduce performance in some applications. It definitely did take heroic engineering efforts to make x86 work well. But, all told, it just doesn't matter very much.


The hit rate on that UOP cache is only about 80%, according to Intel. And UOP cache misses will have strong locality, so they can’t really be hidden.


>> I have an AMD CPU. Zen CPUs come with a fairly wide backend. But the frontend is what it is...

Zen 5 is widening the front end. My guess is with scaling coming to an end, one nice tweak with Zen 6 should be darn near the end of the performance road for a bit. Not saying the actual end, but it should be one of those sweet spots where you build a PC and it's really good for years to come.

I'm still running Raven Ridge and have no need to upgrade, but I will when I can get double the cores or more at double the IPC or more, and maybe at lower power ;-)


I'm sure someone has thought of this before, but there's x86 as the full ISA, and there's x86 as gcc (or whatever) actually uses it. What if you had one or two cores that actually run x86 and the rest just run a reasonable subset of it until they hit a weird instruction and hand the thread over to the more formal one?


You don't need special cores. One thing a lot of people do not seem to realize is compilers do not use the full instruction set. They use a subset and this subset is what has to be fast. Here are the common x86 instructions I remember when I debugged disassembled source code (1).

- mov (an instruction to load, store data) - Basic arithmetic instructions (add, sub, div, mul, etc.) - Bitwise operators (or, and, xor, shift, etc.) - Compare (cmp) - Branch (jz, jnz, j, ret, etc.) - There was one weird instruction I would occasionally see. It was used to zero memory (I forget its name). My guess is the compiler used it because it was faster than running creating a loop which zeroed memory.

The point is, people seem to assume ALL instructions have to be fast. The only instructions which have to be fast are the ones the compilers are using. The rest just have to work.

Making the rarely used instructions work probably takes up a negligible amount of space on a typical core. Moving these instructions to a special core would not get you much but would make old software run slower (becuase of the context switch to the other core and/or a limited number of cores it could run on).

Note that I know AMD and Intel both have a lot of guidance for compiler writers.

(1) I debugged disabled code because I was debugging optimized code and it was the only way to reliable determine why a function crashed or malfunctioned.


> One thing a lot of people do not seem to realize is compilers do not use the full instruction set. They use a subset and this subset is what has to be fast.

That’s a self-fulfilling prophecy. Compiler writers won’t use the instructions that are (relatively) slow. If, tomorrow, Intel makes ‘add’ slow, compilers will eventually start compiling addition as negation + subtraction (and AMD will become more popular)

That’s what makes hardware designers unhappy sometimes. If they ship a shiny faster CPU that requires a large overhaul of compilers, it can take time for compiler writers to catch up.

Hardware designers shouldn’t (only) aim for getting closer to the local optimum of where compiler writers are, but also for the higher peaks of where they could be.


Hardware designers typically work with compiler authors to get them to use instructions they'd like.


They already do this. The core has 4 or 5 decoders that work in parallel to decode up to 16 instruction bytes, and only the first one can handle more complex instructions. If there is a complex instruction after the first slot in the byte chunk, it realigns the instruction stream (how's that for technobabble?) so the first decoder gets the complex instruction in the next cycle.

This means to exploit your processor's full decode performance you have to reorder your instructions so the complex ones are the first in each 16-byte group, or something like that.


Lots of companies have tried the road of making existing code execute slowly in a "legacy" mode and asking code to be recompiled to get performance. Those companies generally are not around anymore.


>The x86 decoders consume a reasonable amount of power

this article states otherwise, so how is it?


I think the intent here is to say that decoders' current power consumption is reasonable (i.e., within reason), but widening them could make their power consumption unreasonable (i.e., too high).


Maybe but maybe not.. the E cores on Raptorlake have two 3 wide decoders, apparently it can decode 6 instructions this way(though how often or reliably that is I don't think Intel has said).


Note that Golden Cove has a 6-wide decoder, and does not have a clear advantage over Zen 4 with a 4-wide decoder. Other parts of the architecture affect performance more, so it doesn't make sense to put more area and power budget into the decoder.

(also I'm the author of the article)


Is that because of ISA complexity? Especially with x86 growth by accretion while keeping backward compatibility all the way to 16-bit.

So you could conceivably afford a wide frontend if you restricted x86 to a subset (64 bit only, drop a bunch of weird CISC'y instructions).


It’s mainly (I think) the distribution of instruction lengths. x86 instructions have any length from 1 to 15 bytes. A decoder wants to decode multiple instructions per cycle, and it generally does this in parallel, by simultaneously deciding at multiple starting points. With a fixed length ISA, to decode n instructions, you just decode them. With x86, if you simultaneously decode at offset 0, 1, …, 7, you have 8 decoders but are only likely to decode a couple of correct instructions. The rest start in the middle of an instruction and need to be discarded. So you either need many more parallel decoders for the same throughput or a more complex system to try to avoid throwing away so much work.


> ... or a more complex system to try to avoid throwing away so much work.

IIRC Jim Keller said in some interview that modern x86 uses prediction and speculation (similar to branch prediction), and it works surprisingly well.


I’m sure this is doable, but I would certainly count it as “complex”.

But fundamentally, a given chip, dedicating a given area to the task, can only begin to decode at so many positions per cycle. And the more intelligent it tries to be about where to start decoding, the longer into that cycle it needs to wait.

And one nastiness about x86 is that you have to decode pretty far into an instruction to even determine its likely length. You can’t do something like looking up the likely length in a table indexed by the first byte of an instruction.

I wonder whether modern chips have pipelined decoders.


All modern chips have pipelined decoders, including ARM ones. For example, the Cortex A72 has three decode stages, and it's running a 3-wide decoder at low clock speeds.



>You even have open source superscalar RISC-V cores just publicly available on Github running on FPGAs with 8 wide decode. Large frontends are (relatively) cheap and easy, if you're not x86.

Which one? I know BOOM can technically go eight wide in so far as it's parametrizable but I suspect any BOOM backend which could support that much throughout would be a nightmare to instantiate on nearly any FPGA.


I had VROOM! in mind (https://github.com/MoonbaseOtago/vroom) because I remembered it aims for 4 IPC avg with a width of 8. Though looking again it's 8 compressed 16 bit instructions or 4 uncompressed 32 bit instruction.

So you could argue a real mix of instructions is not going to be all 16 bit but some 16 and some 32, so the 8 is rarely achieved in practice, and also the block diagram only shows 4 decode blocks. But it can in fact peak at 8 instructions decoded per clock, so I'll call that 8 wide decode.

(You could even argue it's especially impressive, since RISC-V technically qualifies as variable-length encoding like x86, it's just that only the 16/32 instructions encoding are really in use at the moment)


As you point out current VROOM! is 4-8 into the decoder depending on the mix of incoming instructions, each decode block can decode one 32-bit instruction, or 2 16-bit ones.

A VERY general rule of thumb is that every 5th instruction is a branch, decoding more than 8 may be pointless

However VROOM! can replay 8 instructions per clock out of the trace cache .... (they're already decoded and get pushed into the pipe after the decoders)


Hey! I looked at your blog and loved your bug analysis, it's extremely rare to see this stuff publicly but I'm excited for more as the FOSS CPU scene speeds up. Hopefully the wide decode will be more useful when trace cache lands.

Is VROOM primarily targeting ASICs of FPGAs? I know BOOM is primarily meant for ASICs but I feel like there is still a lot of room to build a good OoO FPGA core, particularly since us mere mortals can't get a spot on a wafer big enough to support even the smallest configs of BOOM (or presumably VROOM) since the efabless Sky130 shuttles, while "affordable", are hardly big enough to most OoO cores. I've been thinking about taking the ideas from this paper [1] and adding it to BOOM to try and make it more FPGA friendly since that seems like the most realistic way to get a very high performance custom core at this point.

[1] https://ieeexplore.ieee.org/document/8977924/


I've previously been an ASIC designer, so that's what I've been aiming it at. FPGAs are just a great way to find bugs (you'll notice I've spent no time on performance) though now the design is too big for even that

I've been assuming that anyone building one would be building hand built data paths/register files/caches/TLBs/etc largely because I've come from that world


Do you have references about AMD CPUs being more frequently bottlenecked by instruction decode despite uop caches (vs non x86)?


I don't have any good benchmark to link to off the top of my head, but I'll handwave in the general direction of Agner Fog's guide, which is accurate in my experience (and generally a great resource): https://www.agner.org/optimize/microarchitecture.pdf

From the multiple "Bottlenecks in AMD Zen" sections, a common point is

>The limiting fetch rate of up to 16 bytes per clock is a very likely bottleneck for CPU- intensive code with large loops

Although admittedly, if your small hot loops fit in the uop cache that does largely mitigate the fetch/decode problem


Performant risc-v implementations for desktop/mobile/server, fast!


How fast do you need them to be?

e.g. we know about Ventana Veyron, TBA before end of year, and Tenstorrent's Ascalon, TBA 2024, competitive with Zen5 which is also 2024.

Ascalon is 8-wide, but has smaller siblings at lesser decoder width, to cover a range of uses.

There's also Rivos, MIPS and SiFive working on very high performance cores, but we know less about these efforts.


On desktop, fast enough to run state-of-the-art games, decode 4k av1/h265 videos at 60fps (thread performance) without a 1kW power supply and water cooling. Since there will be a transition period before heavy assembly optimization starts to kick in (similar to dav1d decoder), would start with gcc/clang generated machine code. Oh, and should compile a full blown linux kernel and those horrible gcc/llvm as fast as their arm/x86_64 counter-parts. Compression/decompressing would be a good benchmark too.

Mobile: ratio power consumption/performance to run as smooth as the latest samsung galaxy flag ship the CPU-wise worst android apps, off AC.

Server: erf, I guess that would be tons of cores with tons of ram and cache then back to full blown linux kernel/gcc/llvm compilation speed&compression/decompressing&etc.

We are talking about world domination of a worldwide royalty free/ultra stable in time ISA (oxymoron).


Your post used "performant" and "fast".

I understood fast as to do with roadmaps, and used it that way.

Before year end for Veyron, next year for Ascalon. Potentially, there'll be further announcements for performant microarchitectures next year.

As for performance, with Ascalon promising similar to Zen5 performance at significantly reduced power consumption, I don't think there's much reason for concern: Very performant RISC-V microarchitectures are coming.

With all these companies and their strong teams working on new RISC-V microarchitectures, there'll be plenty of choice, soon.


Hope you will be right.

I am always cautious about this. Very. You have those horrible compilers in the way.

I still dunno on which risc-v microarchitecture I'll start coding assembly (wanted to start on the mango mq pro something for a keyboard firmware 100% 64bits risc-v). Finally starting to get rid of those pesky compilers and system language syntax which are unable to be stable on the long run. At best, we'll get very high level language interpreter directly written in risc-v assembly. Just need to be very cautious about macro processor usage.

I started to code some core functions directly in x86_64, and I know it is only a transition phase towards risc-v 64bits ISA (porting from 1 modern ISA to another is still much, much, less work than to code from scratch). I'll enjoy the additional register space (even with a more register space consuming load/store ISA).

BTW, we still missing hardware instruction for direct memcpy/memset/memcmp in risc-v, that said I don't know if that will be pertinent in the end.


>for a keyboard firmware

Very good to hear, especially if it is open source and performs well (low latency and low jitter).


I would need mango pi mq pros I can buy with a noscript/basic (x)html browsers or a local retail shop, then time, a significant amount. I am currently on a biggy project which will take me months to wrap up (but it starts to get on my nerves... I might send it to hell)

Ofc, this firmware would be bare metal. Not jocking around. For the moment, I am a bit scared at porting to human assembly the SOC and board init code.

The only non-open source programs I know of are elf/linux video games and the steam client.


What does matter is standardization. For example a booting process. When I have x86 image of Windows/Linux, I can boot it on any x86 processor. When I have ARM image, well, then I can boot it on a SoC it is built for and that's big maybe because if outside peripherals are different (i.e. different LCD driver) or lives on different pins of SoC, then I am screwed and will have at best partially working system.

Standardization is something what will carry x86 very far into the future despite its infectivity on low power devices.


> When I have ARM image, well, then I can boot it on a SoC it is built for ... or lives on different pins of SoC, then I am screwed and will have at best partially working system.

That's not even the half of it either... what firmware does the board run? U-boot is nice, but sometimes you aren't lucky and you're stuck with something proprietary. Although if you're extremely lucky, you'll have a firmware that supports efi kernels.


When I have x86 image of Windows/Linux, I can boot it on any x86 processor.

That is largely due to IBM choosing x86, and the PC taking off in a huge way with its de-facto "open" design that ended up being successful and kept backwards compatibility. One could easily imagine a world in which IBM chose ARM (and it was invented earlier) for its first PC, and proprietary x86 SoCs based on Intel's cores are everywhere instead. A world in which CISC is the new fad.

Note: Intel has non-PC-compatible x86 SoCs too.


I'm curious what those would be in the modern era?

The "non-PC" x86 chips I can think of:

80186/188 - Sort of predates the IBM hegemony, can be hammered into shape by adding replacements for onboard peripherals 80376 - a failed long-gone experiment 386CX/EX - not entirely sure how incompatible they are Xeon Phi/Larrabee/Knight's Corner designs -- not really SoCs so much as special purpose acceleratirs.


The PS4 (and probably PS5 as well) use x86 chips but are not PC-compatible.


I think the Moorefield and Merrifield SOC platforms are not PC compatible?


Also these: https://en.wikipedia.org/wiki/Intel_Quark

They are basically a 486 pipeline with some Pentium instructions.

Unfortunately it seems Intel didn't realise that x86 without the PC legacy is worth little, so their attempts at non-PC x86 have mostly failed. On the other hand, "PC-on-a-chip" SoCs like https://en.wikipedia.org/wiki/Vortex86 have enjoyed more popularity.


This might be true, but the world we live in where x86 is the open platform and ARM is a mess of incompatibility.

You want to install Ubuntu on your laptop? Download this ISO and you're good.

You want to install LineageOS on your phone? You have to download the exact binary for your phone (which means LineageOS needs to maintain those hundreds of versions) and hope your phone is supported.


This is because your PC has a software compatibility layer called a BIOS.


The BIOS (or UEFI) is not used by most OSs which aren't DOS. The compatibility comes from the standard peripherals (DMA, PIT, PIC, FDC, 8042) and PnP interfaces like PCI (as well as standardised interfaces located behind them, e.g. USB OHCI/UHCI/EHCI/XHCI, SATA BMIDE, LPT, VGA, etc.)


The BIOS (or UEFI) provides a standard interface to load an operating system which can then discover which hardware is installed.

It's true that another factor is the software discoverability of hardware. A lot of stuff on ARM platforms is not discoverable because those platforms are intended to run specific software.


> When I have x86 image of Windows/Linux, I can boot it on any x86 processor.

Where this gets absurd is modern Debian supports i686 and up. You should be able to get a 27-year-old Pentium Pro to boot the same image as a Raptor Lake CPU.


My memory is that the pentium pro had a common memory config of 256MB, way below what modern installers are going to expect. I'm sure you can get linux to install on 256MB, but I doubt it's going to work on the current RHEL/SUSE/Ubuntu installers.

Not to mention various drivers have gone without maintainers and pulled from the upstream linux kernel.

Not to mention dropping IA32 and related PAE support.


With the push for ARM in the datacenter, ACPI adoption is on the upswing. In theory, ACPI could be used on consumer devices as well, there's just little incentive to do that right now.


The flip side of this is that you can almost always get exactly the SoC you need with an ARM, which makes it great for embedded applications. But yeah, lots of custom board bring up…


> When I have x86 image of Windows/Linux, I can boot it on any x86 processor.

This has come with a stack of caveats since the advent of Secure Boot.


Yeah, but on the other hand these ARM platforms have different capabilities BECAUSE they are not standardized.

All PCs have some kind of fast-updating display output (often HDMI) and a BIOS to emulate a CGA card from 199whatever on this interface. My PineNote has an e-ink driver connected by something called EBC. Is that compatible? Maybe, if someone writes a BIOS to make it work. And that wouldn't be amiss, actually, even though e-ink display isn't optimal for software designed for HDMI, it would at least make it possible to get something working quickly.

It also has a touch stylus and a battery - among other peripherals. Can you tell me which BIOS function code gets the X/Y position of a touch stylus? There isn't one - it would be a non-standard extension and we're back to square one. Or should the BIOS implement an on-screen keyboard? Every PC has a keyboard.


>What does matter is standardization. For example a booting process.

Truth.

This is why RISC-V put a lot of effort on this, and put it early.

Relevant specs include but isn't limited to SBI[0], UEFI protocol[1] and the ongoing platform specification[2].

0. https://github.com/riscv-non-isa/riscv-sbi-doc/releases

1. https://github.com/riscv-non-isa/riscv-uefi/releases/tag/1.0...

2. https://github.com/riscv/riscv-platform-specs


That was mostly IBM's doing and goes well beyond the purview of the ISA specifically. x86 doesn't really define that you "must use a BIOS/UEFI/etc." You can't boot, for instance, a standard copy Windows or Linux on a Sony PS4 (which is not actually a PC compatible, even though it almost seems like one) without ten billion asterisks and hacks.


No you can't, specially in game consoles or embedded deployments, that although migh have x86 CPUs, the motherboard design is incompatible with a standard PC expectations.


This RISC/CISC debate made a lot of sense when the "guts" of the CPU (decoder, registers, ALU, etc) occupied almost the whole chip (or box of transistors), and when memory and CPU had similar speeds.

The situation now is very different. The vast majority of the chip is cache. The trade-off for a much bigger decoder (or register stack, or whatever) is now just a fractionally smaller cache.

And on current systems the CPU and memory operate at enormously different speeds - memory is 10+ times slower than the CPU. So to keep the CPU even vaguely busy, we have resorted to enormous caches. And we use large numbers of cores & threads, so that when one thread is waiting for memory, others may be able to run.

The old game was max performance from a limited number of gates. The current situation makes such enormous numbers of gates available (even on tiny, cheap chips) that we're playing an altogether different game.


The interesting part of RISC vs CISC now isn't the complexity of the instructions or whether they're variable or fixed length. Both ARM and RISC-V has aspects that could be considered CISC-like in that regard.

The interesting difference is the memory model of the instructions. What kind of dependencies between instructions you have to keep track of when decoding them. If I understand correctly, that's the thing that makes it very hard to scale x86 wider, as talked about in another top comment in this thread. The decoder doesn't necessarily scale linearly with the width of the decoder.

Since ARM and RISC-V hasn't traditionally competed in the very high end ultra-wide superscalar CPU space until recently, it's possible that we're not going to see x86 held back until more advanced ARM/RISC-V manages to scale to a wider architecture than what x86 is reasonably capable of.


While area is now very cheap, power and thermal is more of an issue than ever. Huge d/i caches are doable since they save power, huge arrays of decoders not so much. The CISC-RISC debate wasn't just based on area; everyone back then also saw the direction that scaling was taking us.


> memory is 10+ times slower than the CPU

The difference is much greater than that - one uncached RAM access can take hundreds of CPU cycles.


I wonder where we’d be if the idea of CPU-independent bytecode had ever really taken off - for example, the TIMI bytecode of IBM System/38 and AS/400, TenDRA TDF (aka OSF ANDF), WebAssembly. You could have an AOT compiler in the system firmware which the OS invokes, when a program is installed the OS uses AOT to convert it to the actual machine code, about which the OS might know nothing, and could vary incompatibly from CPU to CPU, even among different CPU models in the same family. (Maybe a JIT mode too.)

I guess JVM/CIL are somewhat similar, but at a much higher level - I’m not talking about garbage collection or type safety.

In some ways that is true of TIMI too - it is designed to support a capability-based operating system, and hence has some rather high-level instructions, although still not as high level as JVM/CIL - it was generally used as a compilation target for non-garbage collected languages such as RPG, COBOL, C/C++, PL/I, Fortran, BASIC, Pascal, etc - and hence lacks a garbage collector.


> I wonder where we’d be if the idea of CPU-independent bytecode had ever really taken off [...]. You could have an AOT compiler in the system firmware which the OS invokes [...].

You could say modern CPUs are kind of like tracing JITs. On one hand, a normal tracing JIT has much more memory to save its work than a CPU’s trace cache, but on the other, the superscalar reordering and renaming stuff is even more aggressive than a trace recorder about looking at how the code actually executes and deriving assumptions from that instead of attempting to prove them statically.

Why not AOT instead? In part because they can’t, of course—a tracing JIT requires about the least amount of heavyweight compiler tech out of all the possibilities, which is an advantage if you’re trying to fit the compiler into silicon. (That’s not to say a tracing JIT is easy—the cost of a simple compiler is that you need to make it hella fast for the result to be any good.)

But in part I suspect it’s because a standard assembly-level bytecode kind of sucks to compile ahead of time. About the most useful assumptions such a compiler can make is which things don’t interfere with others, usually memory operations, or perhaps which writes can be forwarded to reads. A tracing JIT can see some of this, a superscalar even more so; an AOT or function-at-a-time JIT, in the absence of any aliasing information or even knowing when one object ends and another begins (boo WebAssembly), can’t.

Ironically, memory segmentation as in the Intel 432 or 286 (or the IBM dinosaurs) feels like could help with that (or are we calling this idea “capability-based” once again?). Does anyone who isn’t just a speculating dilettante (unlike me) think that’s a reasonable thought?

(Wait, is a selector table just a Smalltalk-style object table with a fake moustache?)

Of course, even then we’d still have the problem that VLIW microcode wide enough to require no decoding and engage the entirety of a modern CPU’s physical register file and execution units would be cripplingly slow to fetch from DRAM, and the “legacy” ISAs partly serve a compression format.


Transmeta Crusoe? They chose X86 machine code as their CPU-independent bytecode.


In demos the Transmeta processors was shown to support multiple instruction sets - per https://en.wikipedia.org/wiki/Transmeta#Code_Morphing_Softwa... , they demoed pico-Java , and also there were rumors of PowerPC compatibility.

Although you're probably right - none of those options made it into a shipping product, only x86.


I guess today's equivalent VLIW chip would be Tachyum Prodigy, not super confident about it.. https://www.tachyum.com/products/#products-prodigy


Nvidia's denver2 cores work this way. Shipped on an android tablet about 10 years ago. Not sure what happened to them after that.


Mill uses something similar as well. That being said, I have <1% confidence in Mill ever moving past the slideware stage, so...


"slideware": Hat tip. I never saw that term before. I usually see vaporware, e.g., Duke Nukem Forever.


> I wonder where we’d be if the idea of CPU-independent bytecode had ever really taken off [...] convert it to the actual machine code, about which the OS might know nothing, and could vary incompatibly from CPU to CPU, even among different CPU models in the same family

There are various examples I can think of.

Nvidia's CUDA platform compiles C++-like source code to PTX binary code which is GPU-independent. At run time, PTX is compiled and specialized for the specific GPU model you are running on. I can imagine that PTX is compiled differently depending on the number of registers in the GPU as well as its instruction set capabilities. https://en.wikipedia.org/wiki/Parallel_Thread_Execution

Mainstream virtual machine languages like Java, .NET, and JavaScript are obvious examples.


Given that everything is just microcode anyway, it would be really interesting if some (ex Intel) took their design and only switched out the instruction decode to decode ARM (or whatever) instead.

Sure it wouldn’t be perfect since the chip is optimized based on x86-64 workloads, and they’d never publish it anyway. Plus it may only be simulated instead of spending the money on manufacturing the one-offs.

But boy would it be interesting to see how it performed in various dimensions, just as an exercise.


Given that everything is just microcode anyway, it would be really interesting if some (ex Intel) took their design and only switched out the instruction decode to decode ARM (or whatever) instead.

You probably mean uops, but that thought has also crossed my mind in the past --- a multi-ISA CPU. They could add the decoders for other ISAs, along with extra GDT descriptor types for "ARM mode", "RISC-V mode", etc. segments like they did with V86. It's not a new idea either, https://en.wikipedia.org/wiki/NEC_V30#ISA_extensions could execute both x86 and 8080 code and of course ARM has cores with the triple-mode ARM32/Thumb/Aarch64 ISAs.


Yes I did, thanks. It’s also kind of reminiscent of the Transmeta Crusoe.

The problem I think multi-ISA would run into is the “master of none” issue. Intel can tune for how x86-64 works, Apple and Samsung for ARM.

But if one chip runs it all, it can’t tune for anything too specific.

It must not be worth it. I wonder if Apple would have done something like that for the M series to let it keep running Intel software. They must have tried to figure out if it was worth it right? I know they added a few instructions or an addressing mode or something to help. But they must have determined it wasn’t worth it and it could be done well enough in software.


Exactly, they added an flag to enable total store ordering to help x86 instructions map cleanly to ARM instruction.

https://twitter.com/ErrataRob/status/1331735383193903104

Considering how fast Apple M series can emulate x86 it's clearly not worth adding much more hardware than what they have now.


Back when we were digging into microcode we found a mention of this as a PoC/toy example [1]. Sadly we never found more than an overview, would have liked to know more about it, especially how the update was accepted.

[1] https://troopers.de/events/troopers16/655_the_chimaera_proce... by https://twitter.com/cynicalsecurity


Seems irrelevant though - the internet exists. I'll never have a problem getting the code I need, provided it exists - i.e. if all I have to do to support ARM is use the ARM compiler, then I'll support ARM.

Docker with ARM-specific Linux distributions solves this, as does things like Golang with it's "just set an environment variable and don't even worry about needing a cross-compiler" toolchain.


> Seems irrelevant though - the internet exists. I'll never have a problem getting the code I need, provided it exists

That assumes all code is open source, or else proprietary code shipped with source. That's not the world we live in. Most businesses run at least some closed source on-premise software. Open source is great at providing solutions to problems most people have. But when you start looking at specialised software which is highly industry-specific, suddenly open source starts to look a lot more patchy, and a closed source solution is often the only realistic option.

For example, at many engineering firms (whatever type of engineering they may be doing), you will find heaps of closed source software being used every day. For much of it, there simply is no open source solution available – or if there is, it is missing major features, or is clunky/buggy/poorly-designed, and the amount of extra cost in adopting it will be a lot more than just continuing to pay for the closed source alternative.


I agree with this - but my point is that I think the difficulty of compiling to alternative architectures is more of an impediment. If it's easy, then company's will just do it, give or take "we don't want to support that platform".


> Seems irrelevant though - the internet exists.

... but it's not always useful. MSI motherboards doing "secure boot" can't check for key revocation until after they've booted :( Sometimes you just have to rely on what you've got.

https://arstechnica.com/information-technology/2023/05/leak-...


We kind of have it, except it does not seem to have taken off and has been relegated to a second-class citizen at best.

I am talking about LLVM Bitcode[0] – when the binary product (a executable or .o/.a files) is shipped in the LLVM IR representation and then is «AOT»'d into the final product (a final executable) that can, for instance, take advantage of the latest ISA features (armv9, Zen23, POWER18 or new RISC-V extensions) with zero effort on the end user part. For a while, Apple even encouraged iOS devs to upload their apps into the App Store in the Bitcode format. That has all but ceased to exist for not obvious reasons about a year or two ago. Technically, if Apple chose to transition onto an alternative ISA again (say, RISC-V), at least iOS apps would not require recompilation and would get statically converted to the new ISA at the download time.

Imagine a world where there would be a single Linux distribution for a given architecture shipped in the Bitcode format (sans the small arch specific boot area and the AOT engine), for instance.

[0] https://lowlevelbits.org/bitcode-demystified/

[1] https://www.highcaffeinecontent.com/blog/20190518-Translatin...


LLVM has multiple issues that makes LLVM-Bitcode as a modern ANDF unsuitable.

It is a moving target. SPIR (OpenCL/Vulkan) used to be based on LLVM-IR, but each version had to be locked to one specific version of LLVM and that wasn't viable in the long run. So SPIR-V got its own IR, and hasn't looked back.

There are many subtle differences between architectures and their ABIs. In some ways LLVM-IR is too low-level, so the compiler has to lower to a specific ABI even before emitting LLVM-IR code.

LLVM-IR was made for a C/C++ - compiler, and retains many C-isms still. What is undefined behaviour in C is often undefined behaviour in LLVM-IR, and therefore bugs have different effects on different hardware. A large software vendor would therefore still need to keep a farm of different machines to test its code on, and that is the total opposite of what one would want to accomplish.

A truly hardware-agnostic platform would need to both have its own virtual CPU with defined semantics, and its own ABI, so as to provide an abstraction around the specifics of each hardware platform. But if it has its own ABI, then it wouldn't be 100% interoperable with existing Linux libraries on each hardware platform either.


I don't know about JVM, but CIL doesn't have notion of type safety, it's has been checked by compilator/verificator and doesn't enforced in runtime


Hotspot verifies the bytecode when it's loaded, but after that, it's completely unsafe too. This verification can currently be disabled, but the flag is deprecated for removal at some point.


(Turns to phone booth, ripping off tie) "Sounds like a job for super Forth!"


Seriously - something similar to the threaded interpreted code of Forth might be a nice way to implement low level byte-code on a modern machine.


x86 is already CPU-independent bytecode.


It is interesting to me how both instruction sets have converged on splitting operations into simpler micro ops. The author briefly mentions RISC-V as having "better" core instructions, but it makes me wonder if having the best possible instructions would even help that much.

If you made a CPU that directly ran off of some convergent microcode, would you then lose because of bandwidth of getting those instructions to the chip? Or is compressing instruction streams already a pretty-well-solved problem if you're able to do it from a clean slate, instead of being tied to what instruction representations a chip happened to be using many years ago?


> If you made a CPU that directly ran off of some convergent microcode

I think that’s the original idea behind RISC


Yes, and obviously ARM didn't chose the instructions in its reduced set optimally, if the best implementations require those instructions to be split into smaller ones. But that doesn't really speak to if that's because it's just better to pack instructions that way, or because these implementations of ARM and x86 just need to do it to be performant in spite of deficiencies in their instruction sets.


ARM is a weird beast across the spectrum of RISC designs. The original ISA design is inspired by Berkley RISC (which had only two stage pipeline) and then optimized to what can be reasonably cheaply done in the silicon process used, with the hardware implementation bearing striking similarity to traditional non pipelined "CISC" designs. This design for made cheap implementation of various instructions, like four operand ALU operations or instructions that do multiple memory accesses, which are more or less unthinkable in other RISC designs designed for pipelining and 1 IPC.


Microcode is often attributed to [1] from 1952.

[1] M. V. Wilkes, J. B. Stringer, Micro-programming and the design of the control circuits in an electronic digital computer.


It's arguable that Babbage's design for the Analytical Engine included microcode.

See, i.a., https://www.fourmilab.ch/babbage/glossary.html


Interesting, I had no idea. Wilkes doesn't cite Babbage, oh dear!


It would be interesting if a program could define some new instructions by specifying the microcode :)

Though that might make context switches rather expensive.

And if you're going down that route, perhaps time to think about FPGAs.


edit: The emphasis is wrong here. The main benefit of a relaxed memory model is that it reduces the amount of work the memory subsystem HW needs to do.

Another ISA difference not discussed by the article is the memory model. I expect that the memory model and the degree of reordering permitted by the architecture could have a significant impact on performance.

x86 imposes stricter ordering constraints on memory accesses. The choice of this memory model was made long ago, when, I guess, it was felt that this made the behaviour easier to understand for the programmer.

In contrast, ARM's memory model permits more reordering of memory accesses, which can lead to potential data races and inconsistent program behavior. However, this greater flexibility can also lead to higher performance, as the processor can execute memory accesses in a more efficient manner by overlapping and reordering them.

I can't find any studies that measure the impact of this, but I'd be surprised if it wasn't a significant win for the ARM ISA in many programs.


For languages like C++ and Rust the memory model is part of the language and so is taken into consideration by the compiler backend too. Code that's explicitly asking for Relaxed (the fastest ordering, which has no benefit on x86) is fairly rare.


Hmmm. Fair point. I got this argument from a HW engineer I know. You forced me to think more about what he was saying. His point was actually that the TSO model requires the HW to keep track of more stuff. More stuff uses more energy and adds latency to every memory access. It also limits how many accesses the memory subsystem can handle in parallel. He points out that ARM, RISC-V and even ia64 (Itanium) chose the relaxed model because it is better.


I have wondered something for a while. x86 (32 and 64 bit) have much smaller instruction size on average than RISC ISAs. Given that modern compute performance is heavily affected by cache performance and memory latency, it seems like the smaller codesize could really help with caching performance. Is this actually a factor? Does arm need bigger caches to combat this? Or does the unpredictability of unguessable instruction offsets ruin this in some way? It's obviously only one factor among menay, and ram is often filled with data not just code, but I've wondered about it for a while.


If memory serves, the net code density is on par between x86 and ARM.

Of course if you take some small code samples, yeah, it's an easy win for x86, but for real world programs, there end up being no measurable difference.


> the net code density is on par between x86 and ARM

It was similar with Thumb, but now that there's ARMv8, ARM code density has probably decreased.


> In fact, several workloads saw less power draw with the op cache was disabled. Decoder power draw was drowned out by power draw from other core components, especially if the op cache kept them better fed. That lines up with Jim Keller’s comment.

There's a hidden problem here: disabling the opcache couldn't possibly improve the power draw of other components. The system is simply doing less work per unit time because it issues fewer instructions in the same time.

This is a common benchmarking problem when the workload isn't fixed, e.g. when a benchmark tries to run as many iterations it can in N seconds. By increasing execution speed (or in this case, decreasing it), you also increase the total amount of work done per unit time.


Decoder complexity matters. ARM with its single instruction width allows arbitrarily parallel decoders with only linear growth in transistor count. X86 with its many widths and formats requires decoders that grow exponentially in complexity with parallelism, consuming more silicon and power to achieve higher levels of instruction level parallelism. It requires a degree of brute force with many possible size branches being explored at once among other expensive tricks.

This is one of the major areas where the instruction sets are not equal. ARM has a distinct efficiency advantage.


It's not exponential; it's not even quadratic (it is superlinear), if you put any thought into the design. I worked on an x86 part with 15 decoders/fetch unit. The area was annoying, but unimportant. (We didn't commit 15 ops/cycle; just pc-boundary determination.)

I've also worked on ARM/custom GPU ISAs. The limiting rate is the total complexity of the ISA, not the encoding density.

In fact, from an I$ point-of-view, the tighter x86 encodings are a pretty good win — at least a few % on very long fetch sequences.


> In fact, from an I$ point-of-view, the tighter x86 encodings are a pretty good win — at least a few % on very long fetch sequences.

This is still what baffles me about RISC-V's G profile not requiring the compressed ISA. The performance benefits are quite dramatic for the (comparatively) limited complexity it adds considering G already includes atomics and floating point. I think Linux is eventually going to force the issue since it's 64-bit ABI is GC.


At this point the G subset is just a historical notation shorthand, not really a target.

The new thing is RISC-V Architecture Profiles like RVA23, which comes with the base C extension plus extra compressed instructions specifically intended to reduce code size (Zcb), bitmanip instructions, the V vector extension now being mandatory, etc etc.

This is what big application cores are expected to target in the future, so if successful you could expect Linux distros to start taking advantage of those at some point


Does anyone actually care about this? The x86 decoders are not large on modern implementations, and putting more transistors on dice is a well-solved problem.


It uses more power. The decoder is like another ALU that is always screaming at 100%. It means you can easily keep up with ARM in speed but not power efficiency.


It doesn't seem to matter in practice. Current generation Intel CPUs and the Apple M2 have very similar performance at the same power levels.


How did you arrive at that conclusion? Comparing the M2 Max vs 13650HX it's very obvious the M2 Max uses a LOT less power. It's not even close, it's less then HALF the power. The M2 max has a little worse performance. But it manages to beat the Intel in some benchmarks.


You don't have to let the Intel chips scale up the power like that. You can lock them to whatever power level suits you. An i7-1370P configured at 20W has broadly similar performance to an M2.


Mind linking me some power measurements at same wattage? I didn't even know you could set a power target on Intel or Appl Mx. Well you can disable turbo boost on Intel but even then intel blows over their own marketed TDP by a lot.


Intel introduced the "running average power limit" over ten years ago. https://lkml.indiana.edu/hypermail/linux/kernel/1304.0/01322...


Rapl doesn't allow setting a limit which is always obeyed. You can set PL1 and PL2 limits. But intel CPUs will gladly go over those limits in the short term. For example when running a benchmark. That's why I asked for specific benchmarks which include power measurements.

For example: https://www.notebookcheck.net/i7-1360P-vs-M2_14731_14521.247...

this shows the M2 has a little worse performance compared to the 1360P. But the 1360P requires 2.5x the power to achieve that.


Apple’s chips are on better process nodes, which confuses the issue. That being said, you really have to test chips at the same power level to get an idea of performance per watt in a comparison.

You can easily double CPU power for only a few hundred MHz or 10-20% extra performance.

See https://www.pcworld.com/article/1359352/cool-down-a-deep-div..., which benchmarks chips at different power limits for an example.


Yes I agree with you. That doesn't mean this is easy to achieve. With the exception of AMD chips it's unfortunately very hard to simply "benchmark with a fixed power budget".


Your model of RAPL's abilities is too limited. "PL1/PL2" is a thing that youtube reviewers have figured out, but it is a part of a larger picture. It does not make sense to discuss them without also discussing the time parameters. RAPL is able to hard-cap the (estimated) peak power consumption. If it lacked this feature, operating them at warehouse scale would be impossible.


this article states different, so how is it?


How much power does it (roughly) use? Are we talking about 1% of the overall usage? 10%? 50%?


One could argue that one of the reasons why SIMD instructions, and indeed GPUs, are popular, is because they amortise the (transistor and power) cost of decoding over more compute units, in the case of GPUs over many more.

There are also other considerations, like rolling back state in OOO machines, or precise exceptions. All this becomes more complex with an x86-style instruction set.


There are also other considerations, like rolling back state in OOO machines, or precise exceptions. All this becomes more complex with an x86-style instruction set.

It's not really more complex, because those are backend concerns and work on the uop level, after the instructions have already been decoded into uops.


I was under the impression SIMD is due to clock speed not scaling. Instruction parallelism is hard, so there's a lot to gain from just making instructions wider.


SIMD is orthogonal to clock-speed.

There is indeed a lot to gain from having a single instruction trigger more complex behaviour, for example better instruction density, less instruction decoding needed, but all of this is independent of clock frequency.

I think, but am not sure, that the Thinking Machines' CM-2 from 1988 was a 4096 or 8192 wide SIMD machine. Surely, at the time, clockspeeds where low.


How is it exponential? It's only a multiplicative increase in decode positions.


x86 is not a very CISCy CISC, compared to VAX or 68020. x86 has relatively simple addressing modes and often limits instructions to one memory operand.

It would be interesting to see how well a modern VAX or 68k could compete with x86 and ARM.


I came here to say much the same. The labels were defined (CISC/RISC) in a time of different costs and CPUs. Roll forward, and both RISC has become more complex and CISC has adopted many features of RISC.

I would have loved to see more clear signs of how much L1/L2 cache plays here, and the interconnect between cores. I suspect we're now well down a path where writing code to fit into L1 and writing code to balance load across cores has more importance than anything else.

(not a VLSI or ISA person btw)


I look forward to the day when ISA doesn't matter.

Picking an ARM laptop because right now the technology is better suited to mobile applications while having an x86 desktop because that's the best technology for that right now seems sensible.

Translation layers are constantly improving - maybe soon we will be able to play games and run software with the same level of performance/reliability regardless of the CPU architecture.


Very interesting article. If its assertions and conclusions are correct then it's very useful information.

I've been somewhat suspicious of CISC processors with large microcode real estate ever since the days of the Pentium bug, which was, if I recall correctly, when microcode began to quickly increase in complexity. (I'm not a processor designer but it seems to me that with increased size and complexity bugs would be more likely.)

Alternatively, RISC seemed too limited, but then I'm viewing this from a programming perspective—not that of processor design.

Probably the most interesting aspect of the article was Jim Keller's point about the 'cleanness' of the newer RISC-V architecture, as that may settle the ongoing comparisons and arguments over it and ARM. For people like me who want to see more independent and open hardware architecture available, it'll be interesting to see how that plays out over time.


The chart of how the A64FX supercomputer chip uses uops is telling: https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2021...

We're targeting hardware compilers. At this point why not just do the Transmeta route again and abstract that last bits of specialty hardware. The Crusoe emulated things like the x86 MMU on it's VLIW hardware.

Example: LDADD is actually 4 operations, the chip emulates an atomic operation for one of the simplest instructions.

Things I'd like to happen that are kind of opposites:

- To be able to run uops programs. Sure Intel says the decoder is great and they gave up on making Itanium performant through compilation. Give us choice. There's no way the hardware decoding of instructions is always the most performant option. GPU manufactures acknowledged their fallibility with lower level frameworks like DX11 and Vulkan, and I hope we'll continue getting lower and lower level access to their hardware as well.

- Maybe the chip manufactures could all support a dual operation mode, their legacy ISAs, and a compatible ISA. Look at all these options we have on the way to a convergent ISA:

https://en.wikipedia.org/wiki/Alternate_Instruction_Set

https://hackaday.com/2021/03/26/undocumented-x86-instruction...

The above mentioned Transmeta Crusoe implements an x86 VM. Zilog Z80 is an Intel 8080 with renamed instructions. The Web Assembly route is too high level, it's still running on top of the OS even. Java is cool as is .Net IR, but they also just use assembly. Give me the flexibility of the Gigatron to target two different ISAs : https://hackaday.com/2019/07/02/emulating-a-6502-in-rom/


yeah, no. Tooling support, driver support and general Optimization matters. So does platform maturity.

You don't want your phone to run x86 (and it won't for a while) and though possible its a pain to deal with an arm server at the moment because some random library you use just won't be compatible. And If single threaded performance matters, ARM is behind by a decade.


This isn't the point of the article.

The article is commenting on CPU design: area efficiency, power efficiency, design cost, etc. They're proposing that the reason x86 CPUs have historically beat ARM CPUs in performance, and the reason ARM CPUs have historically beat x86 CPUs in power efficiency, has nothing to do with the design of the ISA itself. You could build an ARM CPU to beat an x86 CPU in high performance computing, or vice versa. They're saying that the format of the instructions and the particular way the operations are structured isn't the driving factor. Instead, it's just a historical arteract of how the ISAs were used.

In other words, yes, there are plenty of ecosystem reasons that these two (and potentially, more) families of chips are better for some things vs. others, but if the two companies swapped their ISAs 30 years ago we might see exactly the same ecosystem just with different instruction formats.


Yeah, ARM's big advantage is that they are willing to make absolutely zero margin on chips, which allows them to play both ends of the spectrum. They can make Celeron-tier chips that don't take any power and practically given away for free and use that as evidence of ARM's superiority in things like the M1 and supercomputers.

Even as Intel is declining, it still makes way, way more money than ARM does. If Intel wanted to play in the lose-money business, it could make something to kick the pants off of ARM's chips. It just would rather not.


I'm not sure I buy that. x86 chips have targeted laptops for decades, yet they were absolutely walloped by Apple's ARM chips. IIRC a big part of that is that the ARM ISA has 2/4 byte instructions whereas x86 can be crazy sizes which made it easier for Apple to have 8 instruction decoders. At least that's the best theory I heard for why it is so fast.


We don't know, is the only good answer. We haven't done much trying in the past 10 years.

Intel's Lakefield was doing quite well in the tablet/MID space. It also had the disadvantages of both comparatively ancient Atom-esque cores- far worse than Intel's new E cores-and a massive massive huge Skylake core. Oh and a replayed decade old desktop iGPU too.

ARM is no longer behind by all that much on single threaded. On geekbench a m2 can do 1916 points, a 7950 2300points. Slightly bigger gap on Cinebench, 1580 Vs 2050. A big part of the gap here is almost certainly the Hz being so different.

We just don't know. There's old beliefs we have held but we had so little evidence for these biases then. X86 rarely tried to be really tiny, had so much more to learn if it was to succeed. ARM rarely tried to be big, and has been learning. There's scant evidence there are real limiting factors for either.


These benchmarks suggest arm has been at single threaded performance parity on server since 2020:

https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...

(Apple Silicon blows them away on laptops, of course.)


I imagine some of the remaining gap might be places where inline ASM or things like SIMD, AVX, etc, exist. Where there's been more years and a larger set of people optimizing that ASM for x86/64 servers.


> the remaining gap might be places where inline ASM or things like SIMD, AVX, etc, exist

This was one of the takeaways back in 2017 when Cloudflare sampled Qualcomm's very early Centriq ARM server chips.

> At Cloudflare we use an improved version of the library, optimized for 64-bit Intel processors, and although it is written mostly in C, it does use some Intel specific intrinsics. Comparing this optimized version to the generic zlib library wouldn’t be fair. Not to worry, with little effort I adapted the library to work very well on the ARMv8 architecture, with the use of NEON and CRC32 intrinsics. In the process it is twice as fast as the generic library for some files.

https://blog.cloudflare.com/arm-takes-wing/


> You don't want your phone to run x86 (and it won't for a while)

It's easily forgotten but there were Android phones which used Intel x86 processors, such as the early Asus Zenfones. They didn't stick though.


I think that's the point, though: why didn't they stick? Was it a price/performance/power issue, which would point to ARM being somehow "better" on those dimensions, or was it something incidental, like production availability, or even just perhaps a lack of desire at Intel to aggressively pursue that market?


At least some of the attempts in question had, ahem, interesting ideas for how to combine baseband and AP on one chip.

(x86 has a truly atrocious thing called SMM (“system management mode”). It is not directly exposed to anything except firmware and malware, so it could be changed in backwards-incompatible ways without breaking existing software, but for some reason neither Intel nor AMD seems willing to do so. For desktop and server workloads, this isn’t really a big deal. For something like an iPhone, it would be a major non-selling-point.

Also, x86 interrupt latency is awful, and this is an ISA issue. x86 cannot currently compete in any application where this matters. FRED may improve this if the implementation is good enough.)


You could even get an AMD chip in your XPPhone https://www.umpcportal.com/products/XPPhone/XPPhone :p


The biggest issue is that ARM was already king, so x86 was an exception that wasn't always accounted for.

So there was occasional incompatibilities and no real benefit to balance them.


Having an Oracle Cloud Free Tier ARM VPS it’s surprising how much just works, I think the only thing I couldn’t run was Chrome Remote Desktop (yes, I want to remote into my VPS sometimes, for example it’s the easiest way to leave a GUI program running in the background without leaving my PC on) and only a few other things needed extra steps. But it’s probably a lot different on desktop or if you’re running different types of programs


Some libraries have x86 SIMD code but no ARM SIMD code, so benchmarking real world use cases you compare SIMD vs scalar code and x86 is much faster. Server side libraries for ARM are a less mature situation than x86.


It sounds like Atom could have found a home in phones.


This would explain part of why Apple hasn't been pushing M2 for the data center. Its chips are a better fit for bursty human workloads, not server workloads.


Apple Silicon chips aren’t for sale outside Apple, and Apple hasn’t made any products relevant to data centers since they terminated the Xserve line as part of the PowerPC -> Intel transition.


The Xserve had Intel Xeon versions, not sure why the termination of that line would have anything to to do with the intel transition, since they got deprecated three years after the transition was complete.


IIRC one problem was that MacOS was not competitive. I recall some benchmarks of applications like MySQL, comparing MacOS Server to Linux on similar or even identical hardware, with results that could only be described as abysmal on MacOS Server.


They have a CPU that's been labeled some version of fastest or most efficient, they're hungry for more revenue, but somehow have no interest in the data center market? There must be a reason.


I think this link posted above is the precise reason - there are already other ARM vendors in the server market with ballpark-similar performance, with better market placement and better business relationships. Nobody trusts Apple not to dump server again after XServe and nobody thinks they're a super great partner to work with.

https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...

Why would Amazon choose to pay more for an Apple branded product when they could make a higher margin just doing it themselves? Hyperscalers are almost definitionally at the limit where scaling works and they have the volume to amortize some basic uarch work. And it'll be optimized for their exact PPA targets to make the lowest TCO/etc.

Beyond hyperscalers, why would anyone else choose Apple over Ampere, given Apple's general lack of commitment to the server market/open ecosystem/etc? What does the cost look like on this now/in the future? Consider what might happen after the relationship ends, friendly or otherwise - can you keep your business operations going in terms of procurement vs committed instances etc? What if they won't send you any more spares or won't sign some key structure for you (drivers/firmware, UEFI keys, whatever, idk)? Surely anyone involved would want to be super duper sure about this. But Apple has gotten bored and left this market before, and they're always the hot one in the relationship. They can find someone else today if they need.

I just don't think there's much of a market for this that wouldn't rather buy an Ampere or build their own clone like Graviton.


Apple sells its products to its markets, and is mostly uninterested in the data center and embedded market. The support costs are large, and the margins are small-ish.

Which is too bad for the industry, but also lucky for the industry.

I mean, they were the first 64-bit ARM chip, period. In fact, everyone mocked it, but they also trembled in fear because 64-bit.


Data centres care about different things in CPUs. You also don’t see consumer intel CPUs (e.g. core i7, or even the high end gamer cpus / workstation xeons) in normal data centres, even though they run x86. You do see arm in data centres, cf graviton on aws, but these are cpus fds signed for data centres (lots of cores, lots of memory, etc).

The big difference is that consumer cpus care less about virtualisation, high core counts, or having lots of memory, and more about single core performance and potentially not getting hot in your lap.


> consumer cpus care less about virtualisation, high core counts, or having lots of memory, and more about single core performance and potentially not getting hot in your lap.

This goes with what I said in my original comment.


There's probably a lot of reasons.

The biggest thing is probably that data centers wouldn't want an M2, M2 Pro, M2 Max, or M2 Ultra. Apple would need a specialized chip. Right now, Apple Silicon tops out at 12 cores (8 performance and 4 efficiency). If I'm a data center, I'm going to likely prefer an AMD EPYC with 64 cores (and 128 threads) over an Apple M2 with 8 performance cores. I can slice that AMD EPYC into a lot more VMs.

Apple would realistically need to make a speciality part for the data center. That would mean taking people off its regular products and tasking them on opening up a new product line that Apple has historically been terrible at. Now Apple has fewer people working on iPhone, Mac, etc. and those products suffer in order to try and enter a market that probably isn't a good fit for them. Heck, data centers aren't going to want CPUs soldered to the motherboard.

Not only that, it would mean tasking software people toward the project. How much of your software staff are now trying to get Linux stable on M2 - taking staff away from iOS/macOS?

Apple doesn't want to do it because it would be a big undertaking. It's not just "print some M2s and sell them to Amazon."

Beyond that, it'd probably be a low margin market compared to what they usually go for. With iPhone and Mac, they have huge product differentiation giving them great margins, but they wouldn't for the datacenter. Even if they're the fastest and most efficient, data center customers are looking for performance-per-dollar. Performance per watt matters in the data center, but not nearly as much as it matters in laptops and phones.

The ARM ISA would mean that they'd need to sell at a discount compared to x64 chips (even if they have better performance) because x64 is the path of least resistance for customers. So Apple would need margins lower than Intel/AMD.

Plus, it would mean taking fab capacity away from iPhones and Macs. We've already seen how constrained that capacity can be. It took AMD 2 years to get to 5nm after Apple launched their 5nm processors. If Apple were to become a large data center player, they'd need to figure out how to prioritize that. For example, only the iPhone 14 Pro got the 4nm A16 processor last year - presumably because TSMC's capacity was really limited. All the rumors on 3nm seem to be similarly constrained. Apple isn't going to risk their cash-cow businesses (iPhone, Mac) for a low margin data center business so that would likely mean shipping data center CPUs that were older nodes. Heck, one of the reasons that AMD hasn't taken over the data center is that they've been a bit supply constrained - and Apple would be too.

There are lots of reasons, but it boils down to the fact that Apple would need to build something they don't currently make - a data center CPU, motherboard, case, open boot system so people can run other operating systems, drivers specs and docs for those other operating systems, etc. Apple would be facing a market where the ARM ISA is a negative, margins aren't as good, and customers would be skeptical of a company whose commitment to enterprise and data centers has been terrible. Plus, Apple's performance supremacy wouldn't even be a total positive in the data center since they're going to be looking at performance per dollar and there would be other companies who would accept low margins all competing in that space.

EDIT: I'd also note that Intel's total revenue is $63B and AMD's is $24B and Apple's is $388B. Let's say Apple is wildly successful and gets a server business as large as AMD's. Apple maybe increases its revenue by 3% (assuming that half of AMD's revenue comes from the data center). So when you say that Apple wants revenue, a new server business wouldn't get them that. More likely, Apple's data center business would be 10% the size of AMD and increase Apple's revenue by 0.3%.


> Right now, Apple Silicon tops out at 12 cores (8 performance and 4 efficiency). If I'm a data center, I'm going to likely prefer an AMD EPYC with 64 cores (and 128 threads) over an Apple M2 with 8 performance cores. I can slice that AMD EPYC into a lot more VMs.

I think it's worth calling out how important this is. Once you get past a certain die size and core count interconnect or "fabric" latency and bandwidth starts to have a much bigger impact on loads than core speed and throughput for code not optimized for that processor. Where M2 is... Apple doesn't have to deal with that at all. AMD on the other hand has gone all in, hence chiplet designs. But yeah Apple wants nothing to do with that, hence they seem to be going for very wide but limited core count designs.


Sure, but Apple's doing pretty well there. Think of the apple silicon as a chiplet. Said chiplet has a 400GB/sec memory bus, 8 performance cores, 4 efficiency cores, and 38 gpu cores (not to mention accel for video encoding, matrix multiplication, and AI.

Said chiplet is sold in 1 and 2 chiplet configurations today (max and ultra flavors of the chip) and have a very healthy chip<->chip connection 2.5TB/sec. No reason Apple couldn't add some glue to allow more than 2 chiplets in a package.


And they don't support ECC. Wholly unsuitable chips for this very different market.


The chips don't have ECC and have very low maximum memory and aren't suitable for the datacenter.


Apple doesn't see value in going after the server market.


The consumer might replace their laptop once a year, at a fast cadence, but they don't wake up one day and say "I actually need 3 laptops".

Whereas datacenters are now being built constantly, and any future projection would estimate that we're going to keep building more - and if you were unsure before, then the sudden popularity of training AIs and their massive demand for compute should've convinced you.

Apple aren't going to leave money on the table (otherwise they'd still be shipping iPhones with chargers and including dongles with laptops): if they're not targeting server markets, it's because their internal modelling is telling them that what they've got is probably at best a peer-capability, rather then some type of vast excession (or would be wiped out easily by another gen of server grade chip releases).


Apple already failed twice on the server market, A/UX and Xserve, and decided it wasn't for them.

They would have to offer a top option with macOS to make it relevant, as they certainly aren't going to be offering their hardware to run GNU/Linux on top of it.

It happened once, with MkLinux, and that is also something that management won't be keen in repeating.


For what? To run GNU/Linux instead of macOS, keep wishing for it.

https://en.wikipedia.org/wiki/MkLinux

If it is to have Xserve again, no one cares about it other than companies on the Apple ecosystem, not worth the trouble, they already have Xcode Cloud for that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: