Zen 2 has has 8-wide issue in many places, and Ice Lake moves up to 6-wide. Intel/AMD have had 4-wide decode and issue width for 10 years and I'm glad they're moving to wider machines.
Could you explain what you mean with "8-wide decode in many places" ? How is that possible, isn't instruction coding kinda always the same? I.e. always 4-wide or always 8-wide, but not sometimes this and sometimes that.
All sources I could find say it is 4-wide, so I'd also be interested if you could perhaps give a link to a source?
The actual instruction decoder is 4-wide. However, the micro-op cache has 8-wide issue, and the dispatch unit can issue 6 instructions per cycle (and can retire 8 per cycle to avoid ever being retire-bound). In practice, Zen 2 generally acts like a 6-wide machine.
Oh, on this terminology: x86 instructions are 1-15 bytes wide (averaging around 3-4 bytes in most code). n-wide decode refers to decoding n instructions at a time.
Thanks for the link! Yeah, that's basically the numbers I also found -- although the number of instructions decoded per clock cycle is a different metric from the number of µop that can be issued, so that feels a bit like moving the goal post.
But, fair enough, for practical applications the latter may matter more. For an apple-to-apple comparison (pun not intended) it'd be interesting to know what the corresponding number for the M1 is; while it is ARM and thus RISC, one might still expect that there can be more than one µop per instructions, at least in some cases?
Of course then we might also want to talk about how certain complex instructions on x86 can actually require more than one cycle to decode (at least that was the case for Zen 1) ;-). But I think those are not that common.
Ah well, this is just intellectual curiosity, at the end of the day most of us don't really care, we just want our computers to be as fast as possible ;-).
I have usually heard the top-line number as the issue width, not the decode width (so Zen 2 is a 6-wide issue machine). Most instructions run in loops, so the uop cache actually gives you full benefit on most instructions.
On the Apple chip: I believe the entire M1 decode path is 8-wide, including the dispatch unit, to get the performance it gets. ARM instructions are 4 bytes wide, and don't generally need the same type of micro-op splitting that x86 instructions need, so the frontend on the M1 is probably significantly simpler than the Zen 2 frontend.
Some of the more complex ops may have separate micro-ops, but I don't think they publish that. One thing to note is that ARM cores often do op fusion (x86 cores also do op fusion), but with a fixed issue width, there are very places where this would move the needle. The textbook example is fusing DIV and MOD into one two-input, two-output instruction (the x86 DIV instruction computes both, but not the ARM DIV instruction).
X86 isn't fixed width instructions. Depending on the mix you may be able to decode more instructions. And if you target common instructions, you can get a lot of benefit in real world programs.
Arm is different but probably easier to decode. So you can widen the decoder.
This I think is the real answer; for a long time people were saying that "CISC is just compression for RISC, making virtue of necessity", but it seems like M1 serves as a good counterexample where a simpler ISA is scaled up to modern transistor counts (and given exclusive access to the world's best manufacturing, TSMC 5nm).
Considering that x86 is less dense than any RISC ISA, the "compression" argument behind CISC falls apart. No surprise a denser, trivial to decode ISA does better.
You have a source for that? The first google result I found for research on that shows it as denser than almost every RISC ISA [1]. It’s just one study and it predates ARM64 fwiw though.
That paper uses no actual benchmarks, but rather grabbed a single system utility and then hand-optimized it; SPEC and geekbench show x86-64 comes in well over 4 bytes on average.
Sure, I never claimed it to be the be-all-end-all, just the only real source I could find. Adding "SPEC" or "geekbench" didn't really help.
Doing a little more digging, I have also found this [1], which claims "the results show that the average instruction length is about 2 to 3 bytes". On the other hand, this [2] finds that the average instruction length is 4.25 bytes.
Bytes per instruction doesn't really say anything useful for code density when talking about RISC vs. CISC though, since (arguably) the whole idea is that individual CISC instructions are supposed to do more than individual RISC instructions. A three instruction CISC routine at five bytes each is still a win over a four instruction RISC routine at four bytes each. Overall code size is what actually matters.
OK, I could see how one could implement a variable width instruction decoder (e.g. "if there are 8 one-byte instructions in a row, handle them, otherwise fallback to 4-way decoding" -- of course much more sophisticated approach could be made).
But is this actually done? I honestly would be interested in a source for that; I just searched again and could find no source supporting this (but of course I may have simply not used the right search, I would not be surprised by that in the least). E.g. https://www.agner.org/optimize/microarchitecture.pdf#page216 makes no mention of this and calls AMD Zen (version 1; it doesn't saying anything on Zen 2/3).
I did find various sources which talk about how many instructions / µops can be scheduled at a time, and there it may be 8-way, but that's a completely different metric, isn't it?
As a historical note, the Pentium P6 uses an interesting approach. It has three decoders but only one of them can handle "complex macroinstructions" that require micro-operations from the ROM. If a limited-functionality decoder got the complex instruction, the instruction gets redirected to another decoder the next cycle.
As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.
> As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.
And how fast is that able to run on x86? How many instructions can that process at once, compared to an alternate universe where that circuit has the same transistor and time budget but only has to look at the first four bits of an instruction?
So Intel and AMD are capable of building a chip like this, but the ambitious size meant it was more economically feasible for Apple to build it themselves?
Neither Intel nor AMD are capable of doing for a very basic reason, there is no market for it. You can't just release a CPU for which there is no operating system.
Apple can pull it off because they already own entire stack from hardware to operating system to cloud services and the can swap out a component like CPU for a different architecture and release new version of OS that supports it.
Apple, by creating new CPU, replace a part of the stack that is owned by Intel by their own which only strengthens them position even if it did not improve any performance.
Apple is invulnerable to other companies copying the CPU and creating their own because they are not really a competition here. Apple sells an integrated product of which CPU is just one component.
That's not entirely true. Windows ARM64 can execute natively on the M1 (through QEMU for hardware emulation, but the instructions execute natively). Intel/AMD could produce an ARM processor that could find a market. They also have a close partnership with Microsoft and I have to believe there would be a path forward there. They could also target Linux.
I haven't seen enough evidence yet though that ARM is the reason M1 performs so efficiently. It may just be the fact that it is on a cutting edge 5nm process with a SoC design. I'm not even sure if the PC/Windows market would adopt such a chip since it lacks upgradability. It's really nice to be able to swap out RAM and/or GPU. Heck even AMD has been retaining backwards compatability with older motherboards since it's been one one socket for a while.
I think for laptops/mobile this makes a lot of sense. For a desktop though I honestly prefer AMD's latest Ryzen Zen 3 chips.
> It may just be the fact that it is on a cutting edge 5nm process with a SoC design.
Yup. It's fast because it's got short distances to memory and everything else. Shorten the wire to memory cells and not only can you make signaling faster and run the memory at faster clock speed but you can do it with less accessory hardware for signal conditioning and error correction, which saves complexity and power. Using shorter paths to memory also lets you use lower voltages, which means less waste heat and less need to spend effort on cooling and overall power savings for the chip.
Shortening the wire also lowers latency between all the various on board devices, so communicating everywhere is faster.
There's a reason that manufacturers used to be able to "speed" up a chip by just doing a die shrink - photographically reducing chip masks to make them smaller, which also made them faster with relatively small amounts of work.
As the late Adm. Grace Hopper put it, there are ever so many picoseconds between the two ends of a wire.
> Shortening the wire also lowers latency between all the various on board devices, so communicating everywhere is faster.
A maximum of a few nanoseconds. Not much in comparison to an overall memory system latency.
> Shorten the wire to memory cells and not only can you make signaling faster and run the memory at faster clock speed but you can do it with less accessory hardware for signal conditioning and error correction, which saves complexity and power.
You cannot run away from that with just shorter PCB distance. The circuitry for link training is mandated by the standard.
You will need a redesigned memory standard for that.
Until the late 90s on-chip wire delays were something we just didn't care much about speed was limited by gate capacitance - we got speedups when we shrunk the gate sizes on transistors - after the mid 90s RC delays in wires started to matter (not speed of light delays, how fast you can shuffle electrons in there to fill up the C) soon after it got worse because wire RC delays don't scale perfectly with shrinks because of edge effects - this was addressed in a number of ways, high speed systems reduced the R by switching from Al wires to Cu, tools got better able to model those delays and synthesize and do layout at (almost) the same time
Intel/AMD could produce an ARM processor that could find a market.
Intel did have an ARM processor line, and it did have a market. They acquired the StrongARM line from Digital Equipment and evolved it into the XScale line. What Intel didn't want was for something to eat into it's x86 market, and Windows/ARM didn't exist. So they evolved ARM in a different direction than Apple later did. It was very successful in the high-performance embedded market.
> "Apple can pull it off because they already own entire stack from hardware to operating system to cloud services and the can swap out a component like CPU for a different architecture and release new version of OS that supports it."
Note that this is the same model that Sun Microsystems, DEC, HP, etc. had and it didn't work out for them.
I'd venture to say that it currently only works out for Apple because Intel has stumbled very, very badly and TSMC has pulled ahead in fabbing process technology. If Intel manages to get back on its feet with both process enhancements and processor architecture (and there's no doubt they've had a wake up call), this strategic move could come back to bite Apple.
Without Linux, they would've lasted longer but still would've lost out on price/performance against x86 and Intel's tick-tock cadence well before Intel's current stumble. We might all have wound up running Windows Server in our datacenters.
I don't understand, how do these low level changes impact the OS exactly assuming that the ISA remains the same? It doesn't seem much more impactful than SSE/AVX and friends, i.e. code not specifically optimized for these features won't benefit but it'll still work and people can incrementally recompile/reoptimize for the new features.
After all that's pretty much how Intel has operated all the way since the 8086.
It's not like Itanium where everything had to be redone from scratch basically.
Are you referring to Apple's laptop x86 -> ARM change? Entertaining the idea that the ISA is significant here: Surely there would be a big market for ARM chips in the Android and server sides too, so this shouldn't be the only reason why other vendors aren't making competitive ARM chips. Apple's laptop volumes aren't that big compared to those markets.
And of course you have to factor in the large amount of pain that Apple is imposing on its user and ISV base in addition to the inhouse cost of switching out the OS and supporting two architectures for a long time in parallel. A vendor making chips for Android or servers wouldn't have to bear that.
Donald Knuth said "The Itanium approach...was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write."[82]
Of course there were itanic-targetting compilers, they worked, just not well enough to deliver on marketing promise (edit: and what the hardware was theoretically capable of).
Compilers existed just fine to do the porting, and solved that problem.
Intel's failure is that they were unable to solve a different problem because that compiler didn't exist, one that went well beyond merely porting.
In other words, "That's what compilers are for." is a perfectly fine attitude when those compilers exist, and a bad attitude when they don't exist. Porting is the former, making VLIW efficient is the latter.
It's not that it was more economical, but that at least some of these AMD and Intel would not benefit from due to the ISA: x64 instructions can be up to 15 bytes, so just finding 8 instructions to decode would be costly, and I assume Intel and AMD think more so than the gains from more decoders (you couldn't keep them fed enough to be worth it, basically).
I can't comment on the economics of it but I can comment on the technical difficulties. The issue for x86 cores is keeping the ROB fed with instructions - no point in building a huge OoO if you can't keep it fed with instructions.
Keeping the ROB full falls on the engineering of the front-end, and here is where CISC v RISC plays a role. The variable length of x86 has implications beyond decode. The BTB design becomes simpler with a RISC ISA since a branch can only lie in certain chunks in a fetched instruction cache line in a RISC design (not so in CISC). RISC also makes other aspects of BPU design simpler - but I digress. Bottom line, Intel and AMD might not have a large ROB due to inherent differences in the front-end which prevent larger size ROBs from being fed with instructions.
(Note that CISC definitely does have it's advantages - especially in large code foot-print server workloads where the dense packing of instructions help - but it might be hindered in typical desktop workloads)
Source: I've worked in front-end CPU micro-architecture research for ~5 years
How do you feel about RISC-V compact instructions? The resulting code seems to be 10-15% smaller than x86 in practice (25-30% smaller than aarch64) while not requiring the weirdness and mode-switching associated with thumb or MIPS16e.
Has there actually been much research into increasing instruction density without significantly complicating decode?
Given the move toward wide decoders, has there been any work on the idea of using fixed-size instruction blocks and huffman encoding?
I can't really comment on the tradeoffs between specific ISAs since I've mainly worked on micro-arch research (which is ISA agnostic for most of the pipeline).
As for the questions on research into looking at decode complexity v instruction density tradeoff - I'm not aware of any recent work but you've got me excited to go dig up some papers now. I suspect any work done would be fairly old - back in the days when ISA research was active. Similar to compiler front-end work (think lex, yacc, grammar etc..) ISA research is not an active area currently. But maybe it's time to revisit it?
Also, I'm not sure if Huffman encoding is applicable to a fixed-size ISA. Wouldn't it be applicable only in a variable size ISA where you devote smaller size encoding to more frequent instructions?
Fixed instruction block was referring to the Huffman encoding. Something like 8 or 16kb per instruction block (perhaps set by a flag?). Compilers would have to optimize to stay within the block, but they optimize for sticking in L1 cache anyway.
Since we're going all-in on code density, let's go with a semi-accumulator 16-bit ISA. 8 bits for instructions, 8 bits for registers (with 32 total registers). We'll split into 5 bits and 3 bits. 5-bits gives access to all registers since quite a few are either read-only (zero register, flag register) or write occasionally (stack pointer, instruction counter). The remaining 3 bits specify 8 registers that can be the write target. There will be slightly more moves, but that just means that moves compress better and seems like it should enforce certain register patterns being used more frequently which is also better for compression.
We can take advantage of having 2 separate domains (one for each byte) to create 2 separate Huffman trees. In the worst case, it seems like we increase our code size, but in more typical cases where we're using just a few instructions a lot and using a few registers a lot, the output size should be smaller. While our worst-case lookup would be 8 deep, more domain-specific lookup would probably be more likely to keep the depth lower. In addition, two trees means we can process each instruction in parallel.
As a final optimization tradeoff, I believe you could do a modified Huffman that always encoded a fixed number of bits (eg, 2, 4, 6, or 8) which would half theoretical decode time at the expense of an extra bit on some encodings. it would be +25% for 3-bit encoding, but only 16% for 5-bit encoding (perhaps step 2, 3, 4, 6, 8). For even wider decode, we could trade off a little more by forcing the compiler to ensure that each Huffman encoding breaks evenly every N bytes so we can setup multiple encoders in advance. This would probably add quite a bit to compiling time, but would be a huge performance and scaling advantage.
Immediates are where things get a little strange. The biggest problem is that the immediate value is basically random so it messes up encoding, but otherwise it messes with data fetching. The best solution seems to be replacing the 5-bit register address with either 5 bits of data or 6 bits (one implied) of jump immediate.
Never gave it too much thought before now, but it's an interesting exercise.
Not necessarily. Samsung used to make custom cores that were just as large if not larger than Apple’s (amusingly the first of these was called M1).
Unfortunately, Samsung’s cores always performed worse and used significantly more power than the contemporary Apple cores.
Apple’s chip team has proven capable of making the most of their transistor budget, and there’s reason to believe neither Intel nor AMD could achieve Apple’s efficiency even if they had the same process, ISA, and area to work with.
> there’s reason to believe neither Intel nor AMD could achieve Apple’s efficiency even if they had the same process, ISA, and area to work with.
From what I have seen the only difference in efficiency is the manufacturing process. M1 consumes about as much power per core as a Ryzen core. AMD also has a mobile chip with 8 non heterogeneous cores that has around the same TDP as the M1.
Apple’s efficiency is based on a very wide and deep core that operates at a somewhat lower clock speed. Frequent branches and latency for memory operations can make it difficult to keep a wide core fully utilized. Moreover, wider cores generally cannot clock as high. That’s why Intel and AMD have chosen to pursue narrower cores that can clock near 5 GHz.
The maximum ILP that can be extracted from code can be increased with better branch prediction accuracy, larger out of order window size, and more accurate memory disambiguation: http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi.... The M1 appears to have made significant advances in all three areas, in order for it to be able to keep such a wide core utilized.
What you write makes sense but it does not address why AMD and Intel could not do the same "even if they had the same process, ISA, and area to work with."
Why will Apple always out compete Intel and other non-vertically-integrated systems? Margins, future potential, customer relationship and compounding growth/virtuous cycle.
The margins logic is simple, iPhones and MacBooks make tons more money per unit compared with a CPU. Imagine if improving the performance of a CPU by 15% makes the demand increase by 1%. For Apple improving the performance of a CPU by 15% makes the demand increase by 1% for the whole iPhone or MacBook. For this reason alone, Apple can invest 2-5x more R&D into their chips than everyone else.
The future potential logic is more nuanced:
1. Intel's/whoever's 10 year vision is to build a better CPU/GPU/RAM/Screen/Camera because their customers are the companies buying CPUs/GPUs/Screens/Cameras/RAM. They are focused on the metrics the market has previously used to measure success and want to build to optimize for those metrics e.g. performance per dollar. Intel doesn't pay for the electricity in the datacenter nor through its customers' complaints about battery life. RAM manufacturers aren't looking at Apple's products and asking, "do consumers even replace still RAM?" i.e. they are focused on "micro"-trends.
2. Apple's vision is to build the best product for customers. They look at "macro"-trends into the future and apply their personal preferences at scale. For example, do people even still need replaceable RAM? Will they want 5G in the future, or can we improve the technology to replace it with direct connections to a LEO satellite cluster?
The customer relationship logic:
Lets take one such example of a macro-trend, VR and other wearables. Apple is tracking these trends and can "bet on" them because its in full control but Nvidia, Intel, etc typically don't want to "bet on" these numbers because even if they are fully invested, their partners (which sell to consumers) might back out. Apple also isn't "betting on" because it has a healthy group of early adopters that trust Apple and will buy and try it even tho a "better" product in the same market segment isn't purchased. Creating/retaining that customer relationship lets Apple over invest into keeping heat (i.e. power) low because its thinking about the whole market segment that Apple's VR headset can start to compete in and collect more revenue from.
Compounding growth/virtuous cycle logic is also relatively simple:
Improving the metrics in any of these 3 previous pillars manipulatively improves the other pillars. i.e. better customer relationship increases cashflow, increseses R&D funding, 1. improves product, improving customer relationship; or 2. reduces costs, increasing margins, and loops back to increasing cash flow.
Windows only a single architecture, so they can't really deviate from that. Sure, windows can switch (or, apparently, run on ARM), but due to the fact that windows applications are generally distributed as binaries, lots of apps wouldn't work.
Linux users would have far less issues, and would be a great clientele for a chip like this, but probably too niche a market, sadly.
Pragmatically, windows runs on a single Architecture.
Sure, there's been editions for other architectures, but they're more anecdotal experiments than something usable.
I can go out and buy several weird ARM or PPC devices and run Linux or OpenBSD on them, and run the same stuff I use on my desktop regularly (except Steam).
The fact that windows relies on a stable ABI is it's major anchor (while Linux only guarantees a stable API).
they're more anecdotal experiments than something usable
Wrong. Microsoft explicitly set out multi-architecture support as a design goal for NT. MIPS was the original reference architecture for NT. Microsoft even designed their own MIPS based systems to do it (called 'Jazz'). There was a significant market for the Alpha port, especially in the database arena, and it was officially supported through Windows 2000. They were completely usable, production systems sold in large numbers.
In the end, the market didn't buy into the vision. The ISVs didn't want to support 3+ different archs. Intel didn't like competition. The history is all pretty well documented should one take the time to learn it.
Except they did, though apparently you missed it. MIPS was the original port. Alpha was supported from NT 3.1 through Windows 2000, and only died because DEC abandoned the Alpha, not that Microsoft abandon Alpha (it was important to their 64-bit strategy). Itanium was supported from Windows 2003 to 2008R2. Support for Itanium only ended at the beginning of this year, once again because the manufacturer abandoned the chip.
I'm sure you can redefine "achieve" to exclude almost 17 years of support (for Itanium), if you're that committed to being right. Heck, x86-64 support has "only" been around for 20 years or so. Doesn't make it right.
well, for some things. fx32 for the apps people wanted though was deficient. The NT3.1-era Alphas didn't have byte-level performance so things like Excel, Word, etc. all ran terribly, as did Emacs and X. I supported a lab of Alphas running Ultrix and they were dogs for anything interactive and fantastic for anything that was a floating point application.
x86 instructions are variable length with byte granularity, and the length isn’t known until you’ve mostly decoded an instruction. So, to decode 4 instructions in parallel, AIUI you end up simultaneously decoding at maybe 20 byte offsets and then discarding all the offsets that turn out to be in the middle of an instruction.
So the Intel and AMD decoders may well be bigger and more power hungry than Apple’s.
But in one x86 instruction you often have more complex operations. Isn't that part of the reason why Sunny Cove has only 4 wide decode but still the decoders can yield 6 micro-ops per cycle? That single stat makes it look worse than it is in reality, I think.
The whole principle of CISC (v RISC) is that you have more information density in your instruction stream. This means that each register, cache, decode unit, etc. is more effective per unit area & time. Presumably, this is how the x86 chips have been keeping up with fewer elements in terms of absolute # of instructions optimized for. The obvious trade-off being the decode complexity and all the extra area that requires. One may argue that this is a worthwhile trade-off, considering the aggregate die layout (i.e. one big complicated area vs thousands of distributed & semi-complicated areas) and economics of semiconductor manufacturing (defect density wrt aggregate die size).
Except that RISC-V ISA manages to reach infornation density on par with x86 via a simple, backwards-compatible instruction compression scheme. It eats up a lot of coding space, but they've managed to make it work quite nicely. ARM64 has nothing like that, even the old Thumb mode is dead.
> The M1 is really wide (8 wide decode)
In contrast to x86 CPUs which are 4 wide decode.
> It has a huge 630 deep reorder buffer
By comparison, Intel Sunny/Willow has 352.