The two extreme cases still existed in that era. One extreme was the DEC VAX. The instruction set is complex, convenient, high level, and slow. The other extreme was the original IBM 801, which led to the IBM POWER architecture. In its pure form, it was one instruction per clock, had lots of registers, and was quite simple. MIPS went down that road in a big way.
Then CISC microprocessors became superscalar, and started executing more than one instruction per clock. Now RISC machines were behind in speed. So they had to become superscalar. That killed the simplicity. There was no longer any real point to pure RISC instruction sets.
(The author mentions the Itanium. That existed mostly because Intel wanted a patentable technology others couldn't clone. It was very original, and not in a useful way.)
The Itanium was an interesting design and it wasn't even Intel's originally; Intel got in on HP's design, and HP was trying to leapfrog the superscalar designs by making parallelism the responsibility of software, akin to how the MIPS had made handling aggressive pipelining the responsibility of software: The Itanium design was to encode multiple instructions in very long instruction words, where all of the instructions in a given word can be executed at once. This removes the need for the hardware to do reordering, and shoves the responsibility for finding parallelism onto the human or compiler, both of which can, presumably, take a more global view of the problem than a piece of silicon can.
A problem with that was, apparently, figuring out good parallelism statically wasn't enough to get the performance gains Itanium needed to be competitive
One of the big headaches with x86 superscalar machines is that finding the instruction boundaries is hard. The decoders, looking ahead, may guess at alignment and decode bytes which represent a totally bogus instruction, which causes work that gets discarded further downstream. Intel and AMD did this very differently in the early days of superscalar.
If you're going to have a variable length instruction set, it would be nice if it worked like UTF-8, where you can start anywhere in a stream and get aligned within a few bytes. x86 is not like that.
Except none of the shipping big.LITTLE implementations are "RISC" under the framework of the article, though. ARM is very definitely post-RISC and has very CISC-y features. Apple's versions go even farther and include x86 compatibility modes, very CISC-Y. And if course Intel's x86 is never accused of being RISC, either.
And really the entire reason big.LITTLE even exists at all is because the cores got too large and complex. Or, alternatively, they got too "CISC-y". So big.LITTLE is therefore about shipping a "CISC" and "RISC" core on the same piece of silicon really.
> So big.LITTLE is therefore about shipping a "CISC" and "RISC" core on the same piece of silicon really.
You're mixing up ISA and implementation. To the extent that RISC / CISC has any meaning at all any more it's a property of the instruction set. RISC-V which makes a big deal out of being RISC is after all just an ISA. And I take issue with the original article on this too.
These things have nothing to do with CISC vs. RISC. Apple's special x86 compatibility is about TSO model of concurrency, which though quite strong and straying far from more RISC-like approaches to concurrency (the Alpha memory model was even abandoned in more recent architectures as too weak) is not especially 'CISC' either.
>"The Itanium design was to encode multiple instructions in very long instruction words, where all of the instructions in a given word can be executed at once.
I believe you are describing VLIW architecture here? Is that correct?
>"This removes the need for the hardware to do reordering, and shoves the responsibility for finding parallelism onto the human or compiler, both of which can, presumably, take a more global view of the problem than a piece of silicon can."
Interesting. How exactly does a VLIW architecture remove the need for reordering? Is it just that any instructions in a word mean automatically mean there's no dependencies in that long instruction? Was that the original intention of VLIW?
I'm curious did Itanium fail because the model of pushing the complexity onto the software and a human failed or did it fail because of lack backward compatibility for a world that was largely x86 at that point?
> I'm curious did Itanium fail because the model of pushing the complexity onto the software and a human failed or did it fail because of lack backward compatibility for a world that was largely x86 at that point?
The answer is: yes
Itanium failed because getting instruction parallelism is actually incredibly hard to do and compilers didn't catch up in time to make it matter. But it also failed because of AMD64 which had backwards compat, was cheaper by a lot, and was a lot easier to get decent perf out of.
Opinion: I think Itanium was always destined to fail. The idea seems good on paper but fails in practicality because it limits what the hardware can do to achieve speed without breaking backwards compat with itself. You can't have a superscalar Itanium by definition because branch prediction etc. is by design delegated to the compiler. This means that the chips can literally only go up via clock speed, which as we now know doesn't scale forever.
> and compilers didn't catch up in time to make it matter.
Yes. I once saw am EE380 talk at Stanford by the HP team working on the Itanium compiler. They had to solve a complicated minimization problem for each block of instructions. It wasn't going well. Branch prediction decisions have to be made at compile time. Intel has built compilers where you feed tracing data back into the compiler to improve prediction, but that it never caught on.
> MIPS
MIPS compilers at one time had lots of flags for telling the compiler what specific MIPS model to target. All models had the same instruction set, but different numbers of functional units, which affected the optimal code order for each model. Software vendors were supposed to provide different executables for each model. That did not go over well.
> I'm curious did Itanium fail because the model of pushing the complexity onto the software and a human failed or did it fail because of lack backward compatibility for a world that was largely x86 at that point?
It's closer to the truth to say that Itanium caused the world to become largely x86. In the 90s, while the personal computing market may have been dominated by x86, the workstation and server markets (which was the segment Itanium was targeting) was definitely a lot broader and competitive. HP canceled its own architecture line to focus on Itanium, and a couple of other architectures saw roadmaps dwindle because of it.
Wikipedia lists these modifications EPIC makes to the basic VLIW concept:
> Each group of multiple software instructions is called a bundle. Each of the bundles has a stop bit indicating if this set of operations is depended upon by the subsequent bundle. With this capability, future implementations can be built to issue multiple bundles in parallel. The dependency information is calculated by the compiler, so the hardware does not have to perform operand dependency checking.
> A software prefetch instruction is used as a type of data prefetch. This prefetch increases the chances for a cache hit for loads, and can indicate the degree of temporal locality needed in various levels of the cache.
> A speculative load instruction is used to speculatively load data before it is known whether it will be used (bypassing control dependencies), or whether it will be modified before it is used (bypassing data dependencies).
> A check load instruction aids speculative loads by checking whether a speculative load was dependent on a later store, and thus must be reloaded.
also:
> How exactly does a VLIW architecture remove the need for reordering? Is it just that any instructions in a word mean automatically mean there's no dependencies in that long instruction? Was that the original intention of VLIW?
That's exactly right: By putting the opcodes in the same word (and/or, in the case of EPIC, in a word subsequent to a previous word without the stop bit set) the entity generating the instruction stream is guaranteeing to the hardware that those opcodes can run in parallel with no problems.
> I'm curious did Itanium fail because the model of pushing the complexity onto the software and a human failed or did it fail because of lack backward compatibility for a world that was largely x86 at that point?
As others said, it was a mix, and, interestingly, the first few Itanium processor generations had a hardware x86 unit to provide compatibility, albeit one that executed x86 code at the speed of 100 MHz Pentium on a 667 MHz Itanium part, and Intel later commissioned software translation, which was actually faster:
> shoves the responsibility for finding parallelism onto the human or compiler, both of which can, presumably, take a more global view of the problem than a piece of silicon can.
I think the big lesson since, well, the LISP machine is that this doesn't really work. What does work is letting the programming languages evolve driven by user needs and back-filling technology to achieve the desired performance.
The average user performance experience on a webpage is standing on the shoulders of a lot of giants, and any performance increase that relies on climbing all the way down and then back up a different stack simply isn't going to get delivered. What people want is faster sequential execution at almost any cost.
There's also an important detail that the language can only know about instruction ordering statically, while the CPU can reorder instructions at runtime depending on exactly what's ready.
> This removes the need for the hardware to do reordering, and shoves the responsibility for finding parallelism onto the human or compiler, both of which can, presumably, take a more global view of the problem than a piece of silicon can.
I think this was proved wrong - the opposite is true; it is incredibly difficult for software to predict the internal state of a CPU at compile time. The silicon is truly the only thing with an accurate account of its internal state ( cache, register renaming, etc ).
> Then CISC microprocessors became superscalar, and started executing more than one instruction per clock. Now RISC machines were behind in speed. So they had to become superscalar. That killed the simplicity. There was no longer any real point to pure RISC instruction sets.
I thought DEC Alpha was always ahead in speed. At least until it got bought and mostly abandoned by Compaq and then HP. Or is Alpha insufficiently RISC?
The Alpha was RISC, and, as seen in page 14 of this very long, very interesting set of slides, it eventually got passed in raw MHz by x86 chips (specifically, the 1 GHz AMD Athlon vs the 667 MHz Alpha 21164 around the year 2000) but you can see it was a close race
>"The instruction set is complex, convenient, high level, and slow."
Could you elaborate on how The DEC VAX ISA was complex yet also convenient? I feel like those two characteristic are at odds with each other. Or does complexity and convenience refer to different aspects i.e implementation vs use?
Convenient to assembly code authors and compiler writers. It had a lot of addressing modes and they were all almost completely orthogonal, which meant you could use them on any register, source or destination, and the hardware and microcode had to figure out how to not only execute the instruction, but back out all of the state if a fault occurred and the CPU had to handle it. In addition to its addressing modes, the VAX also had complicated opcodes, such as POLY, which evaluated a polynomial of arbitrary degree by taking an X value and a pointer to an array of coefficients.
However, in this case it doesn't make much sense - he was changing the file anyway by putting in the "editor's note", with a lot of fancy formatting to boot.
My guess is the author was being cute in writing an acknowledgement of the mistake rather than fixing it silently.
Dave Ditzel (mentioned in the article) is still involved with RISC architectures. His current company Esperanto is shipping a 1088 core RISC-V processor.
To be clear, I understand that the chip is built from 4 general-purpose OOO RISC-V cores, one service (?) processor and the rest of them (1088) are intended for GPU-like compute.
Then they build the compute by using Ice Lake Xeon to drive such 8x SoCs, where each of those SoCs are hosted on their own dedicated PCIe 4.0 slot. And then they upscale the whole thing to a dual-socket system which then translates to 17408 vector/tensor cores and 80 GP cores or ~17.5K cores in total.
Interesting to see John conclude like me after reading Patterson that RISC vs CISC was always about a philosophical difference in how you approach chip design.
This is what I argue in this article as well but I reach a different conclusion from him and I think that is in large part because the rise of RISC-V has made the RISC and CISC distinction more relevant again.
When John wrote that article RISC processors had gotten very complex. Then it was a lot about complex address modes. But the last decades complexity stems more from lots of SIMD instructions.
RISC-V has aimed to reverse that trend and create å significantly smaller ISA. The RISC counter-trend to complexity is thus still alive and kicking.
Does it really though? Isn't RISC-V continually adding new complexity as it attempts to scale up from an ISA only useful for microcontrollers to one more competitive beyond that?
The entire extension system seems pretty "CISC-y" does it not?
No, that is not the case. RISC-V was designed to work both in embedded systems, workstations, supercomputers and specialized hardware. That is why the instruction-set is made modular. It is allow you to tailor the chip to very different types of hardware.
The extension system is exactly why I would call RISC-V the return of RISC. It is what allows you to keep the CPU significantly simpler because you only add what you need for the system you are designing.
For instance if you want really strong vector processing capability you can design very small cores with only vector processing instructions and the most necessary scalar operations. All the stuff you need typically to run a multi-user OS (handle privilege levels) can be thrown out.
That is exactly what Esperanto Technologies doing. They got got four fat Out-of-Order cores with all the instructions you typically would want in a modern CPU running Linux, while there are 1088 small in-order cores with support for RISC-V vector extension. Vector processing actually adds very few transistors if the core is in-order rather than out-of-order.
I would say this is all quite RISCy in that you are making simple tailor made chips rather than making huge complex monoliths to do everything, which is the CISC way IMHO.
Intel btw is realizing their approach was kind of dumb when then tried making their big-little core design. To keep the small cores small they had to throw out the complex AVX-2 instructions.
> Isn't RISC-V continually adding new complexity as it attempts to scale up
On the contrary, some extensions are pretty clearly designed for simplicity. For example the original 'M' extension implemented both multiply and divide insns, but it was found that the latter were not always useful and required large area. So a multiply-only extension was created. The basic set of integer instructions is the one thing that's anywhere close to immutable about "RISC-V", anything else is potentially open to replacement with something better, though of course with the cost of some incompatibility.
And actually, even the base set is not totally unchangeable as shown by the RV-E variant, halving the number of integer registers to 16.
Reducing the register count is not a "RISC" move. It probably makes sense for the target die size of a given product, but it's definitely not something "RISC" related which was, at the time, about doing the exact opposite - increasing the register count relative to CISC CPUs.
Similarly divide wasn't removed. It's still there. Instead a second extension was added that introduced multiply-only variants. Total complexity was increased, not reduced.
I feel that this is a common understanding of RISC but not a particularly useful one because not all complexity is equal.
Things like matrix extensions are "complicated" from a perspective of "there are a lot of instructions that do weird specific things" but, critically, none of these operations are particularly hard to do in hardware. For example, a fixed size matrix multiply is something that the highly parallel nature of hardware is especially well suited for. This greatly contrasts with truly CISC instructions like the VAX's polynomial multiply which effectively just became a massive microcode subroutine because hardware couldn't implement it well either.
During a recent conference I attended, a keynote speaker discussed their idea of having CPUs with at least 10,000 RISC-V cores and servers with one million RISC-V cores [1].
The idea is appealing, but how feasible is it and how does RISC make this possible or useful? I'm just curious what people think about this.
The core count of Tilera CPUs is totally normal today, so not exactly a failed idea. AMD Ryzen thread piper has 64 cores and if you count number of hardware threads you got 128. That is more than Tilera had. AMD is a very successful company. You got also got Ampere Computing with the Altra Max with 128 cores. Anyway these are all cases of CPUs for general purpose computing while the 1000 core RISC-V CPU was really a specialized system for AI acceleration and scientific computing not general purpose computing.
The IO and RAM for that many cores would be a bottleneck. If you manage to keep everything local to a core or a group of cores then you'll severely constrain the kind of programs you can run.
Right. However, we've had GPUs as compute engines for a while now. People have gotten better at getting massively parallel architectures to do something useful. Both machine learning and graphics fit that model.
There have been many dead ends in that space, though. Thinking Machines and the Cell processor come to mind.
Indeed, a GPU is effectively a massive grid of RISC cores (and accessing global memory is a major bottleneck). I do think that "a GPU but RISCV" is an interesting proposal, but I doubt the major players would be willing to abandon their existing instruction sets
It’s clear that coherent shared RAM has no chance of working in such a setting, but how plausible would it be to do some sort of explicitly networked grid interconnect like on the Epiphany or GreenArrays chips?
(My layman’s feeling always was that the GA model of a large network of small and stupid CPUs was underappreciated and hampered by GA’s merciless pricing as a possible FPGA replacement, given how painfully bad the latter are at utilizing the capabilities of modern IC technology—single layer, really?—but I’m not sure how true that is either.)
> It’s clear that coherent shared RAM has no chance of working in such a setting
It's not clear to me actually. You can run cache coherence protocols over a network, they'll just be slow. (But not any slower than message passing would be anyway.)
The Esperanto system with lots of RISC-V cores is built with localized memory for small clusters of cores. It is programmed akin to a graphics card and plenty of workloads are done on graphics cards. That means it screams on machine learning, crypto, scientific computing and many other number crunching tasks. Sure it will not make your MS Word run 1 million times faster but who needs that?
It's pretty much the same idea as "compute in memory", which is generally considered good wrt. memory bandwidth per performed instruction. Keep everything as local as possible, don't go through a costly Von Neumann bottleneck between a powerful CPU and a large, sparsely accessed RAM.
>RISC architecture is gonna to change everything. -- Acid Burn
That line from the movie Hacker's elicited many "that didn't age well" comments in the years after the movie came out but ultimately proved to be correct. It just needed a long enough timeframe to happen.
I am curious. As I understand it, both RISC and CISC processors are implemented in microcode for the actual hardware gates on the chip. And perhaps different hardware generations make it easier to build different micro-code machine architectures.
So why not just offer an instruction set that matches the base hardware architecture? Rather than have all that decoding done on the chip, why not by a compiler? I understand that branch prediction can only be done at run time, but presumably there is some "closer to the metal" instruction set that would be faster than instruction decoding. Or is instruction decoding very low cost?
> So why not just offer an instruction set that matches the base hardware architecture?
That's pretty much what a VLIW is. The problem is, well, it matches the base hardware architecture. Any change at all in μarch means a full rewrite of any binary code you might ever want to run on the chip. The Mill folks (who are essentially doing a VLIW with lots of clever heuristics/tricks to try and extract more parallelism from the code, on par with mainstream out-of-order chips) are unusually clear about that. Part of the point of a generic ISA like RISC-V is to act as a layer of abstraction, and the "close to the metal" approach you're describing doesn't do that.
To answer the last question first: Yes instruction decoding is dirt cheap (generally speaking) the thing that slows down a CPU is memory model, register retirement etc.
In theory RISC was intended to be basically no microcode where the CPU would expose the hardware instructions. In reality that is done to an extent but superscalar basically made the advantage that had mostly obsolete. This left RISC having the main advantage of having a ton of registers which does simplify other things when designing a superscalar CPU. But at the end of the day most architectures are falling into the "FISC" singularity where speed is all that matters. This is why AARCH64 was deliberately designed to be insanely superscalar from the ground up.
I think that comparability is usually pretty important. You want want a cpu that your existing compiled programs will run on. Also compiler support, if every new revision required a new instruction set then all the compilers out here would need updating each time. I believe instruction decoding is pretty damn fast (and pipelined too!)
It's a great submission, but please don't post archive links to HN when there's a live version of the original available. You're welcome to put the archive link in a comemnt; lots of people do that.
Then CISC microprocessors became superscalar, and started executing more than one instruction per clock. Now RISC machines were behind in speed. So they had to become superscalar. That killed the simplicity. There was no longer any real point to pure RISC instruction sets.
(The author mentions the Itanium. That existed mostly because Intel wanted a patentable technology others couldn't clone. It was very original, and not in a useful way.)