RISC vs. CISC: The Post-RISC Era (1999)

Animats · on Jan 5, 2023

The two extreme cases still existed in that era. One extreme was the DEC VAX. The instruction set is complex, convenient, high level, and slow. The other extreme was the original IBM 801, which led to the IBM POWER architecture. In its pure form, it was one instruction per clock, had lots of registers, and was quite simple. MIPS went down that road in a big way.

Then CISC microprocessors became superscalar, and started executing more than one instruction per clock. Now RISC machines were behind in speed. So they had to become superscalar. That killed the simplicity. There was no longer any real point to pure RISC instruction sets.

(The author mentions the Itanium. That existed mostly because Intel wanted a patentable technology others couldn't clone. It was very original, and not in a useful way.)

msla · on Jan 5, 2023

The Itanium was an interesting design and it wasn't even Intel's originally; Intel got in on HP's design, and HP was trying to leapfrog the superscalar designs by making parallelism the responsibility of software, akin to how the MIPS had made handling aggressive pipelining the responsibility of software: The Itanium design was to encode multiple instructions in very long instruction words, where all of the instructions in a given word can be executed at once. This removes the need for the hardware to do reordering, and shoves the responsibility for finding parallelism onto the human or compiler, both of which can, presumably, take a more global view of the problem than a piece of silicon can.

A problem with that was, apparently, figuring out good parallelism statically wasn't enough to get the performance gains Itanium needed to be competitive

https://stackoverflow.com/questions/1011760/what-are-the-tec...

For example, Itanium struggled with the non-deterministic nature of memory latency

https://softwareengineering.stackexchange.com/questions/2793...

AMD iterating on x86 to produce AMD64 and giving x86 code a compatible path to the 64-bit world certainly didn't help Itanium's prospects, either.

Also:

> There was no longer any real point to pure RISC instruction sets.

They are simpler to decode and execute, which is nice if you're making a small, cheap core aiming at low power consumption.

klelatti · on Jan 5, 2023

> They are simpler to decode and execute, which is nice if you're making a small, cheap core aiming at low power consumption.

Yes and there is an interesting corollary that it makes a big.LITTLE type arrangement more effective.

Animats · on Jan 5, 2023

> They are simpler to decode

One of the big headaches with x86 superscalar machines is that finding the instruction boundaries is hard. The decoders, looking ahead, may guess at alignment and decode bytes which represent a totally bogus instruction, which causes work that gets discarded further downstream. Intel and AMD did this very differently in the early days of superscalar.

If you're going to have a variable length instruction set, it would be nice if it worked like UTF-8, where you can start anywhere in a stream and get aligned within a few bytes. x86 is not like that.

kllrnohj · on Jan 5, 2023

Except none of the shipping big.LITTLE implementations are "RISC" under the framework of the article, though. ARM is very definitely post-RISC and has very CISC-y features. Apple's versions go even farther and include x86 compatibility modes, very CISC-Y. And if course Intel's x86 is never accused of being RISC, either.

And really the entire reason big.LITTLE even exists at all is because the cores got too large and complex. Or, alternatively, they got too "CISC-y". So big.LITTLE is therefore about shipping a "CISC" and "RISC" core on the same piece of silicon really.

klelatti · on Jan 6, 2023

> So big.LITTLE is therefore about shipping a "CISC" and "RISC" core on the same piece of silicon really.

You're mixing up ISA and implementation. To the extent that RISC / CISC has any meaning at all any more it's a property of the instruction set. RISC-V which makes a big deal out of being RISC is after all just an ISA. And I take issue with the original article on this too.

zozbot234 · on Jan 5, 2023

These things have nothing to do with CISC vs. RISC. Apple's special x86 compatibility is about TSO model of concurrency, which though quite strong and straying far from more RISC-like approaches to concurrency (the Alpha memory model was even abandoned in more recent architectures as too weak) is not especially 'CISC' either.

bogomipz · on Jan 5, 2023

>"The Itanium design was to encode multiple instructions in very long instruction words, where all of the instructions in a given word can be executed at once.

I believe you are describing VLIW architecture here? Is that correct?

>"This removes the need for the hardware to do reordering, and shoves the responsibility for finding parallelism onto the human or compiler, both of which can, presumably, take a more global view of the problem than a piece of silicon can."

Interesting. How exactly does a VLIW architecture remove the need for reordering? Is it just that any instructions in a word mean automatically mean there's no dependencies in that long instruction? Was that the original intention of VLIW?

I'm curious did Itanium fail because the model of pushing the complexity onto the software and a human failed or did it fail because of lack backward compatibility for a world that was largely x86 at that point?

leeter · on Jan 5, 2023

> I'm curious did Itanium fail because the model of pushing the complexity onto the software and a human failed or did it fail because of lack backward compatibility for a world that was largely x86 at that point?

The answer is: yes

Itanium failed because getting instruction parallelism is actually incredibly hard to do and compilers didn't catch up in time to make it matter. But it also failed because of AMD64 which had backwards compat, was cheaper by a lot, and was a lot easier to get decent perf out of.

Opinion: I think Itanium was always destined to fail. The idea seems good on paper but fails in practicality because it limits what the hardware can do to achieve speed without breaking backwards compat with itself. You can't have a superscalar Itanium by definition because branch prediction etc. is by design delegated to the compiler. This means that the chips can literally only go up via clock speed, which as we now know doesn't scale forever.

Animats · on Jan 5, 2023

> and compilers didn't catch up in time to make it matter.

Yes. I once saw am EE380 talk at Stanford by the HP team working on the Itanium compiler. They had to solve a complicated minimization problem for each block of instructions. It wasn't going well. Branch prediction decisions have to be made at compile time. Intel has built compilers where you feed tracing data back into the compiler to improve prediction, but that it never caught on.

> MIPS

MIPS compilers at one time had lots of flags for telling the compiler what specific MIPS model to target. All models had the same instruction set, but different numbers of functional units, which affected the optimal code order for each model. Software vendors were supposed to provide different executables for each model. That did not go over well.

jcranmer · on Jan 5, 2023

> I'm curious did Itanium fail because the model of pushing the complexity onto the software and a human failed or did it fail because of lack backward compatibility for a world that was largely x86 at that point?

It's closer to the truth to say that Itanium caused the world to become largely x86. In the 90s, while the personal computing market may have been dominated by x86, the workstation and server markets (which was the segment Itanium was targeting) was definitely a lot broader and competitive. HP canceled its own architecture line to focus on Itanium, and a couple of other architectures saw roadmaps dwindle because of it.

leeter · on Jan 5, 2023

RISC Architectures confirmed murdered by Itanium: Alpha AXP, MIPS (on the desktop), and HP PA-RISC

Architectures set back significantly by Itanium: POWER, SPARC

msla · on Jan 5, 2023

> I believe you are describing VLIW architecture here? Is that correct?

Yes. The Itanium was a specific form of VLIW called EPIC, for Explicitly Parallel Instruction Computing:

https://en.wikipedia.org/wiki/Explicitly_parallel_instructio...

Wikipedia lists these modifications EPIC makes to the basic VLIW concept:

> Each group of multiple software instructions is called a bundle. Each of the bundles has a stop bit indicating if this set of operations is depended upon by the subsequent bundle. With this capability, future implementations can be built to issue multiple bundles in parallel. The dependency information is calculated by the compiler, so the hardware does not have to perform operand dependency checking.

> A software prefetch instruction is used as a type of data prefetch. This prefetch increases the chances for a cache hit for loads, and can indicate the degree of temporal locality needed in various levels of the cache.

> A speculative load instruction is used to speculatively load data before it is known whether it will be used (bypassing control dependencies), or whether it will be modified before it is used (bypassing data dependencies).

> A check load instruction aids speculative loads by checking whether a speculative load was dependent on a later store, and thus must be reloaded.

also:

> How exactly does a VLIW architecture remove the need for reordering? Is it just that any instructions in a word mean automatically mean there's no dependencies in that long instruction? Was that the original intention of VLIW?

That's exactly right: By putting the opcodes in the same word (and/or, in the case of EPIC, in a word subsequent to a previous word without the stop bit set) the entity generating the instruction stream is guaranteeing to the hardware that those opcodes can run in parallel with no problems.

> I'm curious did Itanium fail because the model of pushing the complexity onto the software and a human failed or did it fail because of lack backward compatibility for a world that was largely x86 at that point?

As others said, it was a mix, and, interestingly, the first few Itanium processor generations had a hardware x86 unit to provide compatibility, albeit one that executed x86 code at the speed of 100 MHz Pentium on a 667 MHz Itanium part, and Intel later commissioned software translation, which was actually faster:

https://www.informationweek.com/it-life/intel-sees-a-32-bit-...

Here's a very informative (but long) bunch of slides about Itanium in theory and practice:

https://users.nik.uni-obuda.hu/sima/letoltes/Processor_famil...

pjc50 · on Jan 5, 2023

> shoves the responsibility for finding parallelism onto the human or compiler, both of which can, presumably, take a more global view of the problem than a piece of silicon can.

I think the big lesson since, well, the LISP machine is that this doesn't really work. What does work is letting the programming languages evolve driven by user needs and back-filling technology to achieve the desired performance.

The average user performance experience on a webpage is standing on the shoulders of a lot of giants, and any performance increase that relies on climbing all the way down and then back up a different stack simply isn't going to get delivered. What people want is faster sequential execution at almost any cost.

There's also an important detail that the language can only know about instruction ordering statically, while the CPU can reorder instructions at runtime depending on exactly what's ready.

pjmlp · on Jan 5, 2023

Connection Machine showed otherwise, just like HPC languages as Chapel.

Thetawaves · on Jan 5, 2023

> This removes the need for the hardware to do reordering, and shoves the responsibility for finding parallelism onto the human or compiler, both of which can, presumably, take a more global view of the problem than a piece of silicon can.

I think this was proved wrong - the opposite is true; it is incredibly difficult for software to predict the internal state of a CPU at compile time. The silicon is truly the only thing with an accurate account of its internal state ( cache, register renaming, etc ).

nordsieck · on Jan 5, 2023

> Then CISC microprocessors became superscalar, and started executing more than one instruction per clock. Now RISC machines were behind in speed. So they had to become superscalar. That killed the simplicity. There was no longer any real point to pure RISC instruction sets.

I thought DEC Alpha was always ahead in speed. At least until it got bought and mostly abandoned by Compaq and then HP. Or is Alpha insufficiently RISC?

msla · on Jan 5, 2023

The Alpha was RISC, and, as seen in page 14 of this very long, very interesting set of slides, it eventually got passed in raw MHz by x86 chips (specifically, the 1 GHz AMD Athlon vs the 667 MHz Alpha 21164 around the year 2000) but you can see it was a close race

https://users.nik.uni-obuda.hu/sima/letoltes/Processor_famil...

This PDF shows the Alpha 21164 was superscalar:

https://acg.cis.upenn.edu/milom/cis501-Fall09/papers/Alpha21...

nordsieck · on Jan 5, 2023

While that's true, it still retained a performance advantage in floating point[1]. And that's after 2 years of neglect after being sold to Compaq.

---

1. https://www.realworldtech.com/battle64/

bogomipz · on Jan 5, 2023

>"The instruction set is complex, convenient, high level, and slow."

Could you elaborate on how The DEC VAX ISA was complex yet also convenient? I feel like those two characteristic are at odds with each other. Or does complexity and convenience refer to different aspects i.e implementation vs use?

msla · on Jan 5, 2023

Convenient to assembly code authors and compiler writers. It had a lot of addressing modes and they were all almost completely orthogonal, which meant you could use them on any register, source or destination, and the hardware and microcode had to figure out how to not only execute the instruction, but back out all of the state if a fault occurred and the CPU had to handle it. In addition to its addressing modes, the VAX also had complicated opcodes, such as POLY, which evaluated a polynomial of arbitrary degree by taking an X value and a pointer to an array of coefficients.

ryao · on Jan 5, 2023

> [Editor's note: this example actually finds 20^4, not 20^3. I'll correct it when the load on the server goes down. Still, it serves its purpose.

I guess the load on the server has yet to abate for 24 years.

mardifoufs · on Jan 5, 2023

Wait, why would he need to wait for the load to go down before making a change? What am I missing, was it normal back then?

relaxing · on Jan 5, 2023

It's possible if there was so much load on the server (traffic from Slashdot?) he couldn't telnet in or open an FTP session to change the HTML.

/. post: https://tech.slashdot.org/story/99/10/21/0848202/risc-vs-cis... 119 comments wasn't considered doing big numbers then. Note someone does call out the Cube issue. No upvotes.

However, in this case it doesn't make much sense - he was changing the file anyway by putting in the "editor's note", with a lot of fancy formatting to boot.

My guess is the author was being cute in writing an acknowledgement of the mistake rather than fixing it silently.

googlryas · on Jan 5, 2023

It might invalidate a bunch of caches and increase load even more.

relaxing · on Jan 5, 2023

Ars might have been running Squid back then, but I kind of doubt it. They weren't that huge of a deal.

kllrnohj · on Jan 5, 2023

Wouldn't putting that editor's note there also have done that?

drmpeg · on Jan 5, 2023

Dave Ditzel (mentioned in the article) is still involved with RISC architectures. His current company Esperanto is shipping a 1088 core RISC-V processor.

https://www.youtube.com/watch?v=5foT3huJ_Gg

menaerus · on Jan 5, 2023

To be clear, I understand that the chip is built from 4 general-purpose OOO RISC-V cores, one service (?) processor and the rest of them (1088) are intended for GPU-like compute.

Then they build the compute by using Ice Lake Xeon to drive such 8x SoCs, where each of those SoCs are hosted on their own dedicated PCIe 4.0 slot. And then they upscale the whole thing to a dual-socket system which then translates to 17408 vector/tensor cores and 80 GP cores or ~17.5K cores in total.

Pretty exciting.

socialdemocrat · on Jan 5, 2023

Interesting to see John conclude like me after reading Patterson that RISC vs CISC was always about a philosophical difference in how you approach chip design.

This is what I argue in this article as well but I reach a different conclusion from him and I think that is in large part because the rise of RISC-V has made the RISC and CISC distinction more relevant again.

https://itnext.io/risc-vs-cisc-microprocessor-philosophy-in-...

When John wrote that article RISC processors had gotten very complex. Then it was a lot about complex address modes. But the last decades complexity stems more from lots of SIMD instructions.

RISC-V has aimed to reverse that trend and create å significantly smaller ISA. The RISC counter-trend to complexity is thus still alive and kicking.

kllrnohj · on Jan 5, 2023

Does it really though? Isn't RISC-V continually adding new complexity as it attempts to scale up from an ISA only useful for microcontrollers to one more competitive beyond that?

The entire extension system seems pretty "CISC-y" does it not?

socialdemocrat · on Jan 6, 2023

No, that is not the case. RISC-V was designed to work both in embedded systems, workstations, supercomputers and specialized hardware. That is why the instruction-set is made modular. It is allow you to tailor the chip to very different types of hardware.

The extension system is exactly why I would call RISC-V the return of RISC. It is what allows you to keep the CPU significantly simpler because you only add what you need for the system you are designing.

For instance if you want really strong vector processing capability you can design very small cores with only vector processing instructions and the most necessary scalar operations. All the stuff you need typically to run a multi-user OS (handle privilege levels) can be thrown out.

That is exactly what Esperanto Technologies doing. They got got four fat Out-of-Order cores with all the instructions you typically would want in a modern CPU running Linux, while there are 1088 small in-order cores with support for RISC-V vector extension. Vector processing actually adds very few transistors if the core is in-order rather than out-of-order.

I would say this is all quite RISCy in that you are making simple tailor made chips rather than making huge complex monoliths to do everything, which is the CISC way IMHO.

Intel btw is realizing their approach was kind of dumb when then tried making their big-little core design. To keep the small cores small they had to throw out the complex AVX-2 instructions.

zozbot234 · on Jan 5, 2023

> Isn't RISC-V continually adding new complexity as it attempts to scale up

On the contrary, some extensions are pretty clearly designed for simplicity. For example the original 'M' extension implemented both multiply and divide insns, but it was found that the latter were not always useful and required large area. So a multiply-only extension was created. The basic set of integer instructions is the one thing that's anywhere close to immutable about "RISC-V", anything else is potentially open to replacement with something better, though of course with the cost of some incompatibility.

And actually, even the base set is not totally unchangeable as shown by the RV-E variant, halving the number of integer registers to 16.

kllrnohj · on Jan 5, 2023

Reducing the register count is not a "RISC" move. It probably makes sense for the target die size of a given product, but it's definitely not something "RISC" related which was, at the time, about doing the exact opposite - increasing the register count relative to CISC CPUs.

Similarly divide wasn't removed. It's still there. Instead a second extension was added that introduced multiply-only variants. Total complexity was increased, not reduced.

sweetjuly · on Jan 5, 2023

I feel that this is a common understanding of RISC but not a particularly useful one because not all complexity is equal.

Things like matrix extensions are "complicated" from a perspective of "there are a lot of instructions that do weird specific things" but, critically, none of these operations are particularly hard to do in hardware. For example, a fixed size matrix multiply is something that the highly parallel nature of hardware is especially well suited for. This greatly contrasts with truly CISC instructions like the VAX's polynomial multiply which effectively just became a massive microcode subroutine because hardware couldn't implement it well either.

stefanpie · on Jan 5, 2023

During a recent conference I attended, a keynote speaker discussed their idea of having CPUs with at least 10,000 RISC-V cores and servers with one million RISC-V cores [1].

The idea is appealing, but how feasible is it and how does RISC make this possible or useful? I'm just curious what people think about this.

[1]: https://www.microarch.org/micro55/media/keynote_ditzel.pdf

pjc50 · on Jan 5, 2023

Tilera tried that, I see they are now out of business. https://en.wikipedia.org/wiki/Tilera

As others have pointed out, it makes your memory latency problems worse, not better. It's the equivalent of "beowulf cluster of raspberry pi's".

socialdemocrat · on Jan 6, 2023

The core count of Tilera CPUs is totally normal today, so not exactly a failed idea. AMD Ryzen thread piper has 64 cores and if you count number of hardware threads you got 128. That is more than Tilera had. AMD is a very successful company. You got also got Ampere Computing with the Altra Max with 128 cores. Anyway these are all cases of CPUs for general purpose computing while the 1000 core RISC-V CPU was really a specialized system for AI acceleration and scientific computing not general purpose computing.

speed_spread · on Jan 5, 2023

The IO and RAM for that many cores would be a bottleneck. If you manage to keep everything local to a core or a group of cores then you'll severely constrain the kind of programs you can run.

Animats · on Jan 5, 2023

Right. However, we've had GPUs as compute engines for a while now. People have gotten better at getting massively parallel architectures to do something useful. Both machine learning and graphics fit that model.

There have been many dead ends in that space, though. Thinking Machines and the Cell processor come to mind.

klyrs · on Jan 5, 2023

Indeed, a GPU is effectively a massive grid of RISC cores (and accessing global memory is a major bottleneck). I do think that "a GPU but RISCV" is an interesting proposal, but I doubt the major players would be willing to abandon their existing instruction sets

mananaysiempre · on Jan 5, 2023

It’s clear that coherent shared RAM has no chance of working in such a setting, but how plausible would it be to do some sort of explicitly networked grid interconnect like on the Epiphany or GreenArrays chips?

(My layman’s feeling always was that the GA model of a large network of small and stupid CPUs was underappreciated and hampered by GA’s merciless pricing as a possible FPGA replacement, given how painfully bad the latter are at utilizing the capabilities of modern IC technology—single layer, really?—but I’m not sure how true that is either.)

zozbot234 · on Jan 5, 2023

> It’s clear that coherent shared RAM has no chance of working in such a setting

It's not clear to me actually. You can run cache coherence protocols over a network, they'll just be slow. (But not any slower than message passing would be anyway.)

socialdemocrat · on Jan 6, 2023

The Esperanto system with lots of RISC-V cores is built with localized memory for small clusters of cores. It is programmed akin to a graphics card and plenty of workloads are done on graphics cards. That means it screams on machine learning, crypto, scientific computing and many other number crunching tasks. Sure it will not make your MS Word run 1 million times faster but who needs that?

zozbot234 · on Jan 5, 2023

It's pretty much the same idea as "compute in memory", which is generally considered good wrt. memory bandwidth per performed instruction. Keep everything as local as possible, don't go through a costly Von Neumann bottleneck between a powerful CPU and a large, sparsely accessed RAM.

Mountain_Skies · on Jan 5, 2023

>RISC architecture is gonna to change everything. -- Acid Burn

That line from the movie Hacker's elicited many "that didn't age well" comments in the years after the movie came out but ultimately proved to be correct. It just needed a long enough timeframe to happen.

fastaguy88 · on Jan 5, 2023

I am curious. As I understand it, both RISC and CISC processors are implemented in microcode for the actual hardware gates on the chip. And perhaps different hardware generations make it easier to build different micro-code machine architectures.

So why not just offer an instruction set that matches the base hardware architecture? Rather than have all that decoding done on the chip, why not by a compiler? I understand that branch prediction can only be done at run time, but presumably there is some "closer to the metal" instruction set that would be faster than instruction decoding. Or is instruction decoding very low cost?

zozbot234 · on Jan 5, 2023

> So why not just offer an instruction set that matches the base hardware architecture?

That's pretty much what a VLIW is. The problem is, well, it matches the base hardware architecture. Any change at all in μarch means a full rewrite of any binary code you might ever want to run on the chip. The Mill folks (who are essentially doing a VLIW with lots of clever heuristics/tricks to try and extract more parallelism from the code, on par with mainstream out-of-order chips) are unusually clear about that. Part of the point of a generic ISA like RISC-V is to act as a layer of abstraction, and the "close to the metal" approach you're describing doesn't do that.

leeter · on Jan 5, 2023

To answer the last question first: Yes instruction decoding is dirt cheap (generally speaking) the thing that slows down a CPU is memory model, register retirement etc.

In theory RISC was intended to be basically no microcode where the CPU would expose the hardware instructions. In reality that is done to an extent but superscalar basically made the advantage that had mostly obsolete. This left RISC having the main advantage of having a ton of registers which does simplify other things when designing a superscalar CPU. But at the end of the day most architectures are falling into the "FISC" singularity where speed is all that matters. This is why AARCH64 was deliberately designed to be insanely superscalar from the ground up.

almost · on Jan 5, 2023

I think that comparability is usually pretty important. You want want a cpu that your existing compiled programs will run on. Also compiler support, if every new revision required a new instruction set then all the compilers out here would need updating each time. I believe instruction decoding is pretty damn fast (and pipelined too!)

Macha · on Jan 5, 2023

It turns out sufficiently advanced compilers are hard to write (see: Itanium)

Sparkyte · on Jan 5, 2023

As more RISC and CISC architectures add more sets and extensions the more similar they become as they end up targeting each other's design challenges.

Which makes architecture moot.

rluoy · on Jan 5, 2023

I really enjoy reading Jon's book, inside the machine.

dang · on Jan 5, 2023

It's a great submission, but please don't post archive links to HN when there's a live version of the original available. You're welcome to put the archive link in a comemnt; lots of people do that.

We changed the URL from https://web.archive.org/web/19991129051550/http://arstechnic....

xeeeeeeeeeeenu · on Jan 5, 2023

Sorry for that. At first I found a broken, unreadable version of the article[1] and I guess I assumed they don't have a better copy.

[1] - https://archive.arstechnica.com/cpu/4q99/risc-cisc/rvc-1.htm...

dang · on Jan 5, 2023

No worries! The Google doth not always deal the same results.