The two extreme cases still existed in that era. One extreme was the DEC VAX. Th...

msla · on Jan 5, 2023

The Itanium was an interesting design and it wasn't even Intel's originally; Intel got in on HP's design, and HP was trying to leapfrog the superscalar designs by making parallelism the responsibility of software, akin to how the MIPS had made handling aggressive pipelining the responsibility of software: The Itanium design was to encode multiple instructions in very long instruction words, where all of the instructions in a given word can be executed at once. This removes the need for the hardware to do reordering, and shoves the responsibility for finding parallelism onto the human or compiler, both of which can, presumably, take a more global view of the problem than a piece of silicon can.

A problem with that was, apparently, figuring out good parallelism statically wasn't enough to get the performance gains Itanium needed to be competitive

https://stackoverflow.com/questions/1011760/what-are-the-tec...

For example, Itanium struggled with the non-deterministic nature of memory latency

https://softwareengineering.stackexchange.com/questions/2793...

AMD iterating on x86 to produce AMD64 and giving x86 code a compatible path to the 64-bit world certainly didn't help Itanium's prospects, either.

Also:

> There was no longer any real point to pure RISC instruction sets.

They are simpler to decode and execute, which is nice if you're making a small, cheap core aiming at low power consumption.

klelatti · on Jan 5, 2023

> They are simpler to decode and execute, which is nice if you're making a small, cheap core aiming at low power consumption.

Yes and there is an interesting corollary that it makes a big.LITTLE type arrangement more effective.

Animats · on Jan 5, 2023

> They are simpler to decode

One of the big headaches with x86 superscalar machines is that finding the instruction boundaries is hard. The decoders, looking ahead, may guess at alignment and decode bytes which represent a totally bogus instruction, which causes work that gets discarded further downstream. Intel and AMD did this very differently in the early days of superscalar.

If you're going to have a variable length instruction set, it would be nice if it worked like UTF-8, where you can start anywhere in a stream and get aligned within a few bytes. x86 is not like that.

kllrnohj · on Jan 5, 2023

Except none of the shipping big.LITTLE implementations are "RISC" under the framework of the article, though. ARM is very definitely post-RISC and has very CISC-y features. Apple's versions go even farther and include x86 compatibility modes, very CISC-Y. And if course Intel's x86 is never accused of being RISC, either.

And really the entire reason big.LITTLE even exists at all is because the cores got too large and complex. Or, alternatively, they got too "CISC-y". So big.LITTLE is therefore about shipping a "CISC" and "RISC" core on the same piece of silicon really.

klelatti · on Jan 6, 2023

> So big.LITTLE is therefore about shipping a "CISC" and "RISC" core on the same piece of silicon really.

You're mixing up ISA and implementation. To the extent that RISC / CISC has any meaning at all any more it's a property of the instruction set. RISC-V which makes a big deal out of being RISC is after all just an ISA. And I take issue with the original article on this too.

zozbot234 · on Jan 5, 2023

These things have nothing to do with CISC vs. RISC. Apple's special x86 compatibility is about TSO model of concurrency, which though quite strong and straying far from more RISC-like approaches to concurrency (the Alpha memory model was even abandoned in more recent architectures as too weak) is not especially 'CISC' either.

bogomipz · on Jan 5, 2023

>"The Itanium design was to encode multiple instructions in very long instruction words, where all of the instructions in a given word can be executed at once.

I believe you are describing VLIW architecture here? Is that correct?

>"This removes the need for the hardware to do reordering, and shoves the responsibility for finding parallelism onto the human or compiler, both of which can, presumably, take a more global view of the problem than a piece of silicon can."

Interesting. How exactly does a VLIW architecture remove the need for reordering? Is it just that any instructions in a word mean automatically mean there's no dependencies in that long instruction? Was that the original intention of VLIW?

I'm curious did Itanium fail because the model of pushing the complexity onto the software and a human failed or did it fail because of lack backward compatibility for a world that was largely x86 at that point?

leeter · on Jan 5, 2023

> I'm curious did Itanium fail because the model of pushing the complexity onto the software and a human failed or did it fail because of lack backward compatibility for a world that was largely x86 at that point?

The answer is: yes

Itanium failed because getting instruction parallelism is actually incredibly hard to do and compilers didn't catch up in time to make it matter. But it also failed because of AMD64 which had backwards compat, was cheaper by a lot, and was a lot easier to get decent perf out of.

Opinion: I think Itanium was always destined to fail. The idea seems good on paper but fails in practicality because it limits what the hardware can do to achieve speed without breaking backwards compat with itself. You can't have a superscalar Itanium by definition because branch prediction etc. is by design delegated to the compiler. This means that the chips can literally only go up via clock speed, which as we now know doesn't scale forever.

Animats · on Jan 5, 2023

> and compilers didn't catch up in time to make it matter.

Yes. I once saw am EE380 talk at Stanford by the HP team working on the Itanium compiler. They had to solve a complicated minimization problem for each block of instructions. It wasn't going well. Branch prediction decisions have to be made at compile time. Intel has built compilers where you feed tracing data back into the compiler to improve prediction, but that it never caught on.

> MIPS

MIPS compilers at one time had lots of flags for telling the compiler what specific MIPS model to target. All models had the same instruction set, but different numbers of functional units, which affected the optimal code order for each model. Software vendors were supposed to provide different executables for each model. That did not go over well.

jcranmer · on Jan 5, 2023

> I'm curious did Itanium fail because the model of pushing the complexity onto the software and a human failed or did it fail because of lack backward compatibility for a world that was largely x86 at that point?

It's closer to the truth to say that Itanium caused the world to become largely x86. In the 90s, while the personal computing market may have been dominated by x86, the workstation and server markets (which was the segment Itanium was targeting) was definitely a lot broader and competitive. HP canceled its own architecture line to focus on Itanium, and a couple of other architectures saw roadmaps dwindle because of it.

leeter · on Jan 5, 2023

RISC Architectures confirmed murdered by Itanium: Alpha AXP, MIPS (on the desktop), and HP PA-RISC

Architectures set back significantly by Itanium: POWER, SPARC

msla · on Jan 5, 2023

> I believe you are describing VLIW architecture here? Is that correct?

Yes. The Itanium was a specific form of VLIW called EPIC, for Explicitly Parallel Instruction Computing:

https://en.wikipedia.org/wiki/Explicitly_parallel_instructio...

Wikipedia lists these modifications EPIC makes to the basic VLIW concept:

> Each group of multiple software instructions is called a bundle. Each of the bundles has a stop bit indicating if this set of operations is depended upon by the subsequent bundle. With this capability, future implementations can be built to issue multiple bundles in parallel. The dependency information is calculated by the compiler, so the hardware does not have to perform operand dependency checking.

> A software prefetch instruction is used as a type of data prefetch. This prefetch increases the chances for a cache hit for loads, and can indicate the degree of temporal locality needed in various levels of the cache.

> A speculative load instruction is used to speculatively load data before it is known whether it will be used (bypassing control dependencies), or whether it will be modified before it is used (bypassing data dependencies).

> A check load instruction aids speculative loads by checking whether a speculative load was dependent on a later store, and thus must be reloaded.

also:

> How exactly does a VLIW architecture remove the need for reordering? Is it just that any instructions in a word mean automatically mean there's no dependencies in that long instruction? Was that the original intention of VLIW?

That's exactly right: By putting the opcodes in the same word (and/or, in the case of EPIC, in a word subsequent to a previous word without the stop bit set) the entity generating the instruction stream is guaranteeing to the hardware that those opcodes can run in parallel with no problems.

> I'm curious did Itanium fail because the model of pushing the complexity onto the software and a human failed or did it fail because of lack backward compatibility for a world that was largely x86 at that point?

As others said, it was a mix, and, interestingly, the first few Itanium processor generations had a hardware x86 unit to provide compatibility, albeit one that executed x86 code at the speed of 100 MHz Pentium on a 667 MHz Itanium part, and Intel later commissioned software translation, which was actually faster:

https://www.informationweek.com/it-life/intel-sees-a-32-bit-...

Here's a very informative (but long) bunch of slides about Itanium in theory and practice:

https://users.nik.uni-obuda.hu/sima/letoltes/Processor_famil...

pjc50 · on Jan 5, 2023

> shoves the responsibility for finding parallelism onto the human or compiler, both of which can, presumably, take a more global view of the problem than a piece of silicon can.

I think the big lesson since, well, the LISP machine is that this doesn't really work. What does work is letting the programming languages evolve driven by user needs and back-filling technology to achieve the desired performance.

The average user performance experience on a webpage is standing on the shoulders of a lot of giants, and any performance increase that relies on climbing all the way down and then back up a different stack simply isn't going to get delivered. What people want is faster sequential execution at almost any cost.

There's also an important detail that the language can only know about instruction ordering statically, while the CPU can reorder instructions at runtime depending on exactly what's ready.

pjmlp · on Jan 5, 2023

Connection Machine showed otherwise, just like HPC languages as Chapel.

Thetawaves · on Jan 5, 2023

> This removes the need for the hardware to do reordering, and shoves the responsibility for finding parallelism onto the human or compiler, both of which can, presumably, take a more global view of the problem than a piece of silicon can.

I think this was proved wrong - the opposite is true; it is incredibly difficult for software to predict the internal state of a CPU at compile time. The silicon is truly the only thing with an accurate account of its internal state ( cache, register renaming, etc ).

nordsieck · on Jan 5, 2023

> Then CISC microprocessors became superscalar, and started executing more than one instruction per clock. Now RISC machines were behind in speed. So they had to become superscalar. That killed the simplicity. There was no longer any real point to pure RISC instruction sets.

I thought DEC Alpha was always ahead in speed. At least until it got bought and mostly abandoned by Compaq and then HP. Or is Alpha insufficiently RISC?

msla · on Jan 5, 2023

The Alpha was RISC, and, as seen in page 14 of this very long, very interesting set of slides, it eventually got passed in raw MHz by x86 chips (specifically, the 1 GHz AMD Athlon vs the 667 MHz Alpha 21164 around the year 2000) but you can see it was a close race

https://users.nik.uni-obuda.hu/sima/letoltes/Processor_famil...

This PDF shows the Alpha 21164 was superscalar:

https://acg.cis.upenn.edu/milom/cis501-Fall09/papers/Alpha21...

nordsieck · on Jan 5, 2023

While that's true, it still retained a performance advantage in floating point[1]. And that's after 2 years of neglect after being sold to Compaq.

---

1. https://www.realworldtech.com/battle64/

bogomipz · on Jan 5, 2023

>"The instruction set is complex, convenient, high level, and slow."

Could you elaborate on how The DEC VAX ISA was complex yet also convenient? I feel like those two characteristic are at odds with each other. Or does complexity and convenience refer to different aspects i.e implementation vs use?

msla · on Jan 5, 2023

Convenient to assembly code authors and compiler writers. It had a lot of addressing modes and they were all almost completely orthogonal, which meant you could use them on any register, source or destination, and the hardware and microcode had to figure out how to not only execute the instruction, but back out all of the state if a fault occurred and the CPU had to handle it. In addition to its addressing modes, the VAX also had complicated opcodes, such as POLY, which evaluated a polynomial of arbitrary degree by taking an X value and a pointer to an array of coefficients.