Was at Multiflow (Yale spinoff with Josh Fisher and John O'Donnell) '85-90 and saw the VLIW problem up close (was in the OS group, eventually running it).
The main problem was compiler complexity -- the hoped-for "junk parallelism" gains really never panned out (maybe 2-3X?), so the compiler was best when it could discover, or be fed, vector operations.
But Convex (main competitor at the time) already had the "minisupercomputer vector" market locked up.
So Multiflow folded in early '90 (I had already bailed, seeing the handwriting mural) after burning through $60M in VC, which was a record at the time, I believe.
Multiflow folded in early '90 but then in 1994 HP and Intel bet their futures on this? Convex (using HP PA-RISC no less, defeated Multiflow but HP still thought they could make it work? What was the nature of HP's misjudgement? They must have had a theory why Multiflow failed but they would succeed, no?
I recall there was new research coming out of University of Illinois at Urbana/Champaign that breathed new hope into this but would be curious why HP+Intel pursued it if Multiflow failed. I do recall HP brought a compiler team to the table and per wikipedia they had a research team working on VLIW since 1989. HP has an internal bakeoff of RISC vs VLIW but did it not capture the compiler challenges or workloads properly?
Strategy wise, for a vendor (HP) looking to phase out their RISC investments while maintaining some degree of backwards compatibility and becoming first among equals of their RISC peers partnering with the market volume leader Intel it made great sense... but that only works if the technology pans out. Did HP management get seduced by a competitive strategic play and make a poor engineering/technical call? Or did the internal research team oversell itself even knowing about Multiflow? Or was it something else? Hindsight is 20/20 but what key assumption(s) went wrong that we can all learn from?
I worked at Chromatic, we built a series of 2-wide VLIWs, writing a compiler (actually just the assembler) that could extract that parallelism was pretty easy, just some low level register flow analysis, I can imagine getting something like 6 way would be a lot harder though
Not that it undermines your point much, but $60M (1990 resp. 1985) = $140M resp. $170M today. (I’m not good enough at this to correct for the interest rate differences as well.)
And hilariously*, Convex was eventually eaten by HP. Though the PowerPC Altivec/Velocity engine always looked a lot like Parsec to me. The past lives on in weird places.
Was at Convex and then HP (and then Convey) and worked quite hard porting/optimizing numerical/scientific apps for the I2. Eventually, I think performance for some apps was ok, I mean considering a 900 Mhz clock and all.
I actually have a short book on the Itanic/Itanium done and planned to have it released as a free download by now. But various schedule-related stuff happened and it just hasn't happened yet.
I was a mostly hardware-focused industry analyst during Itanium's heyday so I find the topic really interesting. From a technical perspective, compilers (and dependency on them) certainly played a role but there were a bunch of other lessons too around market timing, partner strategies, fighting the last war, etc.
I worked on Merced post-silicon, and McKinley presilicon. I wasn't an architect on the project, I just worked on keeping the power grid alive and thermals under control. It reminded me of working on the 486: the team was small and engaged, even though HP was problematic for parts of it. Pentium Pro was sucking up all the marketing air, so we were kind of left alone to do our own thing since the part wasn't making money yet. This was also during the corporate wide transition to Linux, removing AIX/SunOS/HPUX. I had a Merced in my office but sadly it was running linux in 32-bit compatibility mode, which is where we spent a lot of time fixing bugs because we knew lots of people weren't going to port to IA64 right away, and that ate up a ton of debug resources. The world was still migrating to Windows NT 3.5 and Windows 95, so migrating to 64 bit was way too soon. I don't remember when the linux kernel finally ported to IA64, but it seemed odd to have a platform without an OS (or an OS running in 32-bit mode). We had plenty of emulators, there's no reason why pre-silicon kernel development couldn't have happened faster (which was what HP was supposed to be doing). Kind of a bummer but it was a fun time, before the race to 1 GHz became the next $$$ sink / pissing contest.
I was at HP pre-Merced tape-out and HP did have a number of simulators available. I worked on a compiler-related team so we were downstream.
As for running linux in 32-bit compatibility mode, wasn't that the worst of all worlds on Merced? When I was there which was pre-Merced tape-out, a tiny bit of the chip was devoted to the IVE (Intel Value Engine) which the docs stated were supposed to be just good enough to book the firmware and then jump into IA64 mode. I figured at the time that this was the goal — boot in 32-bit x86 and then jump to 64-bit mode.
Yes, yes it was! It ended up playing a much larger role for marketing transition efforts, larger than it should have. But the Catch-22 has been analyzed to death.
Well, what actually killed it historically was AMD64. AMD64 could easily not have happened, AMD has a very inconsistent track record; other contemporary CPUs like Alpha were never serious competitors for mainstream computing, and ARM was nowhere near being a contender yet. In that scenario, obviously mainstream PC users would have stuck with x86-32 for much longer than they actually did, but I think in the end they wouldn't have had any real choice but to be dragged kicking and screaming to Itanium.
PowerPC is the one I’d have bet on - Apple provided baseline volume, IBM’s fabs were competitive enough to be viable, and Windows NT had support. If you had the same Itanium stumble without the unexpectedly-strong x86 options, it’s not hard to imagine that having gotten traction. One other what-if game is asking what would’ve happened if Rick Belluzzo had either not been swayed by the Itanium/Windows pitch or been less effective advocating for it: he took PA-RISC and MIPS out, and really helped boost the idea that the combination was inevitable.
I also wouldn’t have ruled out Alpha. That’s another what-if scenario but they had 2-3 times Intel’s top performance and a clean 64-bit system a decade earlier. The main barrier was the staggering managerial incompetence at DEC: it was almost impossible to buy one unless you were a large existing customer. If they’d had a single competent executive, they could have been far more competitive.
Interesting to note that all state of the art video game consoles of the era (xbox 360, PS3 and Wii) used PowerPC CPUs (in the preceding generation the xbox used a Pentium III, the PS2 used MIPS and the GameCube was already PPC).
Address space pressure was immense back in the day, and plain doubling the width of everything while retaining the compatablity was the obvious choice.
>
Address space pressure was immense back in the day, and plain doubling the width of everything while retaining the compat[i]blity was the obvious choice.
PAE (https://en.wikipedia.org/wiki/Physical_Address_Extension) existed for quite some time to enable x86-32 processors to access > 4 GiB of RAM. Thus, I would argue that if the OS provided some functionality to move allocated pages in and out of the 32 bit address space of a process to enable the process to use more than 4 GiB of memory is a much more obvious choice.
> Thus, I would argue that if the OS provided some functionality to move allocated pages in and out of the 32 bit address space of a process to enable the process to use more than 4 GiB of memory ...
Oh, no. Back then the segmented memory model was still remembered and no one wanted a return to that. PAE wasn't seen as anything but a bandaid.
Everyone wanted big flat address space. And we got it. Because it was the obvious choice, and the silicon could support it, Intel or no.
PAE got some use - for that “each process gets 4GB” model you mentioned in Darwin and Linux - but it was slower and didn’t allow individual processes to easily use more than 2-3GB in practice.
In what way? Their track record is pretty consistent actually, which is what partially led to them fumbling the Athlon lead (along with Intel's shady business practices).
During the AMD64 days, AMD was pretty reliable with their technical advancements.
Yes, but AMD was only able to push AMD64 as an Itanium alternative for servers because they were having something of a renaissance with Opteron (2003 launch). In 2000/2001, AMD was absolutely not seen as something any serious server maker would choose over Intel.
You're right, there were ebbs and flows in their influence...but they were consistent in those trends. Releasing an extension during their strong period was almost certain to be picked up, especially if Intel wasn't offering an alternative (which Itanium wasn't considered as it was server only).
My uninformed opinion: lots of speculative execution is good for single core performance, but terrible for power efficiency.
Have data centres always been limited by power/cooling costs, or did that only become a major consideration during the move to more commodity hardware?
Seeing the direction Intel is going with heterogenous compute (P vs E cores) and their patent to replace hyperthreading with the concept of "rentable" units it seems now that exposing the innards of the CPU (thread director) and make it more flexible to OS control that can use better algorithms to decide where/when/how long.
My bad - I misread a doc recently which implied otherwise, albeit that GCN used shorter ones. Just checked AMD docs straight and it was indeed normal scalar instructions.
VLIW reminded me of Transmeta, but unfortunately...
"For Sun, however, their VLIW project was abandoned. David Ditzel left Sun and founded Transmeta along with Bob Cmelik, Colin Hunter, Ed Kelly, Doug Laird, Malcolm Wing and Greg Zyner in 1995. Their new company was focused on VLIW chips, but that company is a story for another day."
> In 1997 Intel was the king of the hill; in that year it first announced the Itanium or IA-64 processor. That same year, research company IDC predicted that the Itanium would take over the world, racking up $38 billion in sales in 2001.
> What we heard was that HP, IBM, Dell, and even Sun Microsystems would use these chips and discontinue anything else they were developing. This included Sun making noise about dropping the SPARC chip for this thing—sight unseen. I say "sight unseen" because it would be years before the chip was even prototyped. The entire industry just took Intel at its word that Itanium would work as advertised in a PowerPoint presentation.
And then the original article has an Intel leader saying "Everything was new. When you do that, you're going to stumble". Yeah, much as Intel stumbled with the Pentium IV and basically everything since Skylake in 2015 (which was late). Let's emphasize this: for near ten years now, Intel can't deliver on time and on target. Just last year, Sapphire Rapids after being late by two years shipped in 2023 March and needed to pause in June because of a bug. Meteor Lake was also two years late. In 2020 https://www.zdnet.com/article/intels-7nm-product-transition-...
> Intel's first 7nm product, a client CPU, is now expected to start shipping in late 2022 or early 2023, CEO Bob Swan said on a conference call Thursday.
> The yield of Intel's 7nm process is now trending approximately 12 months behind the company's internal target.
Well then the internal target must've been late 2021 and it came out late 2023.
I’d be interested in understanding why the compilers never panned out but have never seen a good writeup on that. Or why people thought the compilers would be able to succeed in the first place at the mission.
> I’d be interested in understanding why the compilers never panned out but have never seen a good writeup on that. Or why people thought the compilers would be able to succeed in the first place at the mission.
It's a fundamentally impossible ask.
Compilers are being asked to look at a program (perhaps watch it run a sample set) and guess the bias of each branch to construct a most-likely 'trace' path through the program, and then generate STATIC code for that path.
But programs (and their branches) are not statically biased! So it simply doesn't work out for general-purpose codes.
However, programs are fairly predictable, which means a branch predictor can dynamically learn the program path and regurgitate it on command. And if the program changes phases, the branch predictor can re-learn the new program path very quickly.
Now if you wanted to couple a VLIW design with a dynamically re-executing compiler (dynamic binary translation), then sure, that can be made to work.
> Now if you wanted to couple a VLIW design with a dynamically re-executing compiler (dynamic binary translation), then sure, that can be made to work.
Transmeta lived on in Nvidia's Project Denver but Denver was optimized for x86 and the Intel settlement precluded that. It ended up being too buggy/inefficient to compete in the market and effectively abandoned after the second generation.
This makes a lot of sense to me, thanks for boiling it down. Compilers can predict the code instructions coming up decently, but not really the data coming up, so VLIW doesn't work that well compared to branch prediction and speculative and out of order execution complexities which VLIW tried to simplify away on branching-heavy commercial/database server workloads. Does that sound right?
I think it could have worked if the IDE had performance instrumentation (some kind of tracing) which would have been fed in to the next build. (And perhaps several iterations of this.)
Another way to leverage the Itanium power would have been to make a Java Virtual Machine go really fast, with dynamic binary translation. This way you'd sidestep all the C UB optimization caveats.
One big reason is that it was 20 years ago. At that time, gcc only did rudimentary data flow analysis and full SSA dataflow was at best an experimental feature. Also, the market would not really accept a C compiler that does the kind of agressive UB exploitation needed to extract the paralelism from C code (and instead people mostly tended to pass -Wno-strict-aliasing and friends in order to reduce "warning noise").
This issue is somewhat C specific and Fortran compilers produced decidedly better IA-64 code than C compilers. Which is what together with respectable FP performance of Itanium made it somewhat popular for HPC.
There are a number of reasons for the Itanium's poor performance, and it's the combination of these various factors that did it in. I wasn't present back in the Itanium's heyday, but this is what I gathered.
As a quick recap, superscalar processors have multiple execution units, each of which can execute one instruction each cycle. So if you have three execution units, your CPU can execute up to three instructions every cycle. The conventional way to make use of the power of more than one execution unit is to have an out-of-order design, where a complicated mechanism (Tomasulo algorithm) decodes multiple instructions in parallel, tracks their dependencies and dispatches them onto execution units as they can be executed. Dependencies are resolved by having a large physical register file, which is dynamically mapped onto the programmer-visible logical register file (register renaming). This works well, but is notoriously complex to implement and requires a couple of extra pipeline stages before decode and execution, increasing the latency of mispredicted branches.
The idea of VLIW architectures was to improve on this idea by moving the decision which instruction to execute on which port to the compiler. The compiler, having prescient knowledge about what your code is going to do next, can compute the optimal assignment of instructions to execution units. Each instruction word is a pack of multiple instructions, one for each port, that are executed simultaneously (these words become very wide, hence VLIW for Very Long Instruction Word). In essence, all the bits of the out-of-order mechanism between decoding and execution ports can be done away with and the decoder is much simpler, too.
However, things fail in practice:
* the whole idea hinges on the compiler being able to figure out the correct instruction schedule ahead of time.
While feasible for Intel's/HP's in house compiler team, the authors of other toolchains largely did not bother,
instead opting for more conventional code generation that did not performed all too well.
* This issue was exacerbated by the Itanium's dreadful model for fast memory loads. You see, loads can take a
long time to finish, especially if cache misses or page faults occur. To fix that, the Itanium has the option
to do a speculative load, which may or may not succeed at a later point. So you can do a load from a dubious
pointer, then check if the pointer is fine (e.g. is it in bounds? Is it a null pointer?), and only once it has
been validated you make use of the result. This allows you to hide the latency of the load, significantly
speeding up typical business logic. However, the load can still fail (e.g. due to to pagefault), in which case
your code has to roll back to where the load should be performed and then do a conventional load as a back-up.
Understandably, few, if any compilers ever made use of this feature and load latency was dealt with rather
poorly.
* Relatedly, the latency of some instructions like loads and division is variable and cannot easily be predicted.
So there usually isn't even the one perfect schedule the compiler could find. Turns out the schedule is much
better when you leave it to the Tomasulo mechanism, which has accurate knowledge of the latency of already
executing long-latency instructions.
* By design, VLIW instruction sets encode a lot about how the execution units work in the instruction format. For
example, Itanium is designed for a machine with three execution units and each instruction pack has up to three
instructions, one for each of them. But what if you want to put more execution units into the CPU in a future
iteration of the design? Well, it's not straightforward. One approach is to ship executables in a bytecode,
which is only scheduled and encoded on the machine it is installed on, allowed the instruction encoding and thus
number of ports to vary. Intel had chosen a different approach and instead implemented later Itanium CPUs as
out-of-order designs, combining the worst of both worlds.
* Due to not having register renaming, VLIW architectures conventionally have a large register file (128 registers
in the case of the Itanium). This slows down context switches, further reducing performance. Out-of-order CPUs
can cheat by having a comparably small programmer-visible state, with most of the state hidden in the bowels of
the processor and consequently not in need of saving or restoring.
* Branch prediction rapidly grew more and more accurate shortly after the Itanium's release, reducing the importance
of fast recovery from mispredictions. These days, branch prediction is up to 99% accurate and out-of-order CPUs
can evaluate multiple branches per cycle using speculative execution. A feature, that is not possible with a
straightforward VLIW design due to the lack of register renaming. So Intel locked itself out of one of the most
crucial strategies for better performance with this approach.
* Another enginering issue was that x86 simulation on the Itanium performed quite poorly, giving existing customers
no incentive to switch. And those that did decide to switch found that if they invest into porting their
software, they might as well make it fully portable and be independent of the architecture. This is the same problem
that led to the death of DEC: by forcing their customers to rewrite all the VAX software for the Alpha, the created
a bunch of customers that were no longer locked into their ecosystem and could now buy whatever UNIX box was cheapest
on the free market.
> To fix that, the Itanium has the option to do a speculative load, which may or may not succeed at a later point. So you can do a load from a dubious pointer, then check if the pointer is fine (e.g. is it in bounds? Is it a null pointer?), and only once it has been validated you make use of the result.
Way back in the day, as a fairly young engineer, I was assigned to a project to get a bunch of legacy code migrated from Alpha to Itanium. The assignment was to "make it compile, run, and pass the tests. Do nothing else. At all."
We were using the Intel C compiler on OpenVMS and every once in a while would encounter a crash in a block of code that looked something like this:
It was evaluating both parts of the if statement simultaneously and crashing on the second. Not being allowed to spend too much time debugging or investigating the compiler options, we did the following:
EDIT - I recognize that the above change introduces a potential bug in the program ;) Obviously I wasn't copying code verbatim - it was 10-15 years ago! But you get the picture - the compiler was wonky, even the one you paid money for.
The main case I ever found was implement missing language features. E.G.
break 3; // Break 3 levels up
break LABEL; // Break to a named label - safer-ish than goto
goto LABEL; // When you have no other option.
Usually for breaking out of a really deep set of loops to an outer loop. Such as a data stream reset, end of data, or for an error so bad a different language might E.G. throw an error and usually die.
> the whole idea hinges on the compiler being able to figure out the correct instruction schedule ahead of time. While feasible for Intel's/HP's in house compiler team, the authors of other toolchains largely did not bother, instead opting for more conventional code generation that did not performed all too well.
I definitely think that keeping their compilers as an expensive license was a somewhat legendary bit of self-sabotage but I’m not sure it would’ve helped even if they’d given them away or merged everything into GCC. I worked for a commercial software vendor at the time before moving into web development, and it seemed like they basically over-focused on HPC benchmarks and a handful of other things like encryption. All of the business code we tried was usually slower even before you considered price, and nobody wanted to spend time hand-coding it hoping to make it less uneven. I do sometimes wonder if Intel’s compiler team would have been able to make it more competitive now with LLVM, WASM, etc. making the general problem of optimizing everything more realistic but I think the areas where the concept works best are increasingly sewn up by GPUs.
Your comment with DEC was spot-on. A lot of people I met had memories of the microcomputer era and were not keen on locking themselves in. The company I worked for had a pretty large support matrix because we had customers running most of the “open systems” platforms to ensure they could switch easily if one vendor got greedy.
>By design, VLIW instruction sets encode a lot about how the execution units work in the instruction format. For example, Itanium is designed for a machine with three execution units and each instruction pack has up to three instructions, one for each of them. But what if you want to put more execution units into the CPU in a future iteration of the design? Well, it's not straightforward. One approach is to ship executables in a bytecode, which is only scheduled and encoded on the machine it is installed on, allowed the instruction encoding and thus number of ports to vary.
This was how Sun's MAJC[0] worked -
" For instance, if a particular implementation took three cycles to complete a floating-point multiplication, MAJC compilers would attempt to schedule in other instructions that took three cycles to complete and were not currently stalled. A change in the actual implementation might reduce this delay to only two instructions, however, and the compiler would need to be aware of this change.
This means that the compiler was not tied to MAJC as a whole, but a particular implementation of MAJC, each individual CPU based on the MAJC design.
...
The developer ships only a single bytecode version of their program, and the user's machine compiles that to the underlying platform. "[0]
> Itanium is designed for a machine with three execution units and each instruction pack has up to three instructions, one for each of them.
The design was that each bundle had some extra bits including a stop which was a sort of barrier to execution. The idea was that you could have a series of bundles with no stop bit and the last one would set it. That meant the whole series could be safely scheduled on a future wide IA64 machine. Of course that meant the compiler had to be explicit about that parallelism (hence EPIC) but future machines would be able to schedule on the extra execution units.
This also addressed the problem where VLIW traditionally would require re-compilation to run/run more efficiently on newer hardware.
> Due to not having register renaming, VLIW architectures conventionally have a large register file (128 registers in the case of the Itanium). This slows down context switches, further reducing performance. Out-of-order CPUs can cheat by having a comparably small programmer-visible state, with most of the state hidden in the bowels of the processor and consequently not in need of saving or restoring.
Itanium borrowed the register windows from SPARC. It was effectively a hardware stack that had a minimum of 128 physical registers but were referenced in instructions by 6 bits — e.g. 64 virtual registers, iirc.
So you could make a function call and the stack would push. And a return would pop. Just like SPARC execept they weren't fixed-sized windows.
That said, the penalty for spilling the RSE (They called this part the Register Stack Engine) for say, an OS context switch was quite heavy since you'd have to write the whoe RSE state to memory.
It was pretty cool reading about this stuff as a new grad.
> Another enginering issue was that x86 simulation on the Itanium performed quite poorly, giving existing customers no incentive to switch.
As I mentioned in my previous comment Merced had a tiny corner of the chip devoted to the IVE, Intel Value Engine which was meant to be the very simple 32-bit x86 chip meant mainly for booting the system. The intent was (and the docs also had sample code) to boot, do some set up of system state, and then jump into IA64 mode where you would actually get a fast system.
I think they did devote more silicon to x86 support but I had already served my very short time at HP and Merced still took 2+ years to tape out.
> The design was that each bundle had some extra bits including a stop which was a sort of barrier to execution. The idea was that you could have a series of bundles with no stop bit and the last one would set it. That meant the whole series could be safely scheduled on a future wide IA64 machine. Of course that meant the compiler had to be explicit about that parallelism (hence EPIC) but future machines would be able to schedule on the extra execution units. This also addressed the problem where VLIW traditionally would require re-compilation to run/run more efficiently on newer hardware.
Thanks, that makes sense. I did not understand the intent of the stop bits correctly.
However, it still seems like the design wouldn't scale super well: if you have less ports, you want to schedule dependent instructions on the critical path as early as possible, even if other independent (but not latency-critical) instructions could be scheduled earlier, incurring extra stop bits. So while some degree of performance-portability is designed into the hardware, the compiler may have a hard time generating code that is scheduled well on both 3 port and possible future 6 port machines.
This reminds me of macro-fusion, where there's a similar contradiction: macro fusion only triggers if the fusable instructions are issued back to back. But when optimising for a multi-issue-in-order design, you usually want to interleave dependency chains (i.e. not issue dependet instructions back to back) such that all the pipelines are kept busy. So unless the pairs that fuse are the same on all of them, it's very hard to generate code that performs well on a variety of microarchitectures.
I don't remember if the parent article mentioned it but there were also a bunch of things like the predicate bits for predicated execution and I remember trying to gain an advantage using speculative loads was also very tricky. In the end it was pretty gnarly.
The other bit no one mentions is that it was an HP-Intel alliance. HP committed to PA-RISC compatibility with a combination of hardware and software whereas Intel just expected stuff to run.
From the instruction reference guide:
```
Binary compatibility between PA-RISC and IA-64 is handled through dynamic object code translation. This process is very efficient because there is such a high degree of correspondence between PA-RISC and IA-64 instructions. HP’s performance studies show that on average the dynamic translator only spends 1-2% of its time in translation with 98-99% of the time spent executing native code. The dynamic translator actually performs optimizations on the translated code to take advantage of IA-64’s wider instructions, and performance features such as predication, speculation and large register sets
```
There was some hardware support for 32-bit userspace binaries. See the addp4 instruction.
> That said, the penalty for spilling the RSE (They called this part the Register Stack Engine) for say, an OS context switch was quite heavy since you'd have to write the whoe RSE state to memory.
I've read that the original intention for the RSE was that it would have saved its state in the background during spare bus cycles, which would have reduced the amount of data to save when a context switch happened.
Supposedly, this was not implemented in early models of the Itanium. Was it ever?
> * the whole idea hinges on the compiler being able to figure out the correct instruction schedule ahead of time. While feasible for Intel's/HP's in house compiler team, the authors of other toolchains largely did not bother, instead opting for more conventional code generation that did not performed all too well.
Was Intel's compiler actually able to get good performance on Itanium? How much less screwed would Itanium have been if other toolchains matched the performance on Intel's compiler?
Also, I vaguely remember reading that Itanium also had a different page table structure (like a hash table?). Did that cause problems too?
Intel’s compiler was a bit better than some but still wasn’t great. Largely, Intel quickly lost interest in Itanium when AMD64 started selling well. HP had their own tooling, and HP was pretty much the only customer buying Itaniums. Intel quit investing in Itanium beyond what their contractual obligations to HP dictated.
I am curious about what could have been, but my assumption is that a mature and optimized software industry would be required. This was never going to happen after the launch of AMD64.
It's a long time ago but the thing I remember the most is that the binaries were huge, around 3x the size of x86 binaries. At the time we were very space constrained and that aspect alone was big concern. If the performance had been there it might still have been worth pursuing, but the performance never exceeded the fastest x86 processors at the time.
I didn't know these things, I don't think they are part of the meme-lore about Itanium:
- The problems with the fast load misses and compiler support
- I didn't understand the implications of a completely visible register file
- The trouble with "hard coding" three execution units. Very bad if you can't recompile your code and/or bytecde to a new binary when you get a new CPU.
Your last point about coding your way out of the ecosystem, I wonder if that might have been a reason for why Intel didn't go all-in to make Itanium the Java machine...
These were (intended for) Unix machines, not general purpose PCs… the assumption was that everyone was compiling anything that went on the system anyways, or were buying licenses for a specific hardware box. So at least it wasn’t considered to be a problem at the time.
One other unmet hope was that improvements in compiler technology would give a peer boost, they had up to 10x over the life of the program, which seemed wishful to me at the time (a lowly validation engine just out of college) but if it had worked out, theoretically your old programs would have gotten faster over time just by recompiling, which would have been cool…
"Something of a tragedy: the Itanium was Bob Rau's design, and he died before he had a chance to do it right.
His original efforts wound up being taken over for commercial reasons and changed
into a machine that was rather different than what he had originally intended and the result was the Itanium.
While it was his machine in many ways, it did not reflect his vision."
the total lack of sources and references (other than to the articles on this very blog) is annoying to say the least. is there anything at all to read on this alleged Elbrus influence on Itanium plans, in Russian or English?
HP partnered with Intel to bring HP's Playdoh vliw architecture to market, because HP could not afford to continue investing in new leading-edge fabs. Compaq/DEC similarly killed Alpha shortly before getting acquired by HP, because Compaq could not afford its own new leading edge fab either. SGI spun off its MIPS division and switched to Itanium for the same reason -- fabs were getting too expensive for low-volume parts. The business attraction wasn't Itanium's novel architecture. It was the prospect of using the high-volume most profitable fab lines in the world. But ironically, Itanium never worked well enough to sell in enough volumes to pay its way in either fab investments or in design teams.
The entire Itanium saga was based on the theory that dynamic instruction scheduling via OOO hardware could not be scaled up to high IPC with high clock rates. Lots of academic papers said so. VLIW was sold as a path to get high IPC with short pipelines and fast cycle times and less circuit area. But Intel's own x86 designers then showed that OOO would indeed work well in practice, better than the papers said. It just took huge design teams and very high circuit density, which the x86 product line could afford. That success doomed the Itanium product line, all by itself.
Intel did not want its future to lie with an extended x86 architecture shared with AMD. It wanted a monopoly. It wanted a proprietary, patented, complicated architecture that no one could copy, or even retarget its software. That x86-successor arch could not be yet another RISC, because those programs are too easy to retarget to another assembler language. So, way beyond RISC, and every extra gimmick like rotating register files was a good thing, not a hindrance to clock speeds and pipelines and compilers.
HP's Playdoh architecture came from its HP Labs, as had the very successful PARISC before it. But the people involved were all different. And they could make their own reputations only by doing something very different from PARISC. They sold HP management on this adventure without proving that it would work for business and other nonnumerical workloads.
VLIW had worked brilliantly in numerical applications like Floating Point Systems' vector coprocessor. Very long loop counts, very predictable latencies, and all software written by a very few people. VLIW continues to thrive today in the DSP units inside all cell phone SOCs. Josh Fisher thought his compiler techniques could extract reliable instruction-level parallelism from normal software with short-running loops, dynamically-changing branch probabilities, and unpredictable cache misses. Fisher was wrong. OOO was the technically best answer to all that, and upward compatible with massive amounts of existing software.
Intel planned to reserve the high-margin 64-bit server market for Itanium, so it deliberately held back its x86 team from going to market with their completed 64 bit extensions. AMD did not hold back, so Intel lost control of the market it intended for Itanium.
Itanium chips were targeted only for high-end systems needing lots of ILP concurrency. There was no economic way to make chips with less ILP (or much more ILP), so no Itanium chips cheap and low-power enough to be packaged as development boxes for individual open-source programmers like Torvalds. This was only going to market via top-down corporate edicts, not bottom-up improvements.
The first-gen Itanium chip, Merced, included a modest processor for directly executing x86 32-bit code. This ran much slower than Intel's contemporary cheap x86 chips, so no one wanted that migration route. It also ran slower than using static translation from x86 assembler code to Itanium native code. So HP dropped that x86 portion from future Itanium chips. Itanium had to make it on its own via its own native-built software. The large base of x86 software was of no help. In contrast, DEC designed Alpha and migration tools so that Alpha could efficiently run VAX object code at higher speeds than on any VAX.
> Intel planned to reserve the high-margin 64-bit server market for Itanium, so it deliberately held back its x86 team from going to market with their completed 64 bit extensions. AMD did not hold back, so Intel lost control of the market it intended for Itanium.
Is there anything I can about what Intel planned for their x86 extension to 64 bits? I'm curious about this road not taken.
> Itanium chips were targeted only for high-end systems needing lots of ILP concurrency. There was no economic way to make chips with less ILP (or much more ILP), so no Itanium chips cheap and low-power enough to be packaged as development boxes for individual open-source programmers like Torvalds. This was only going to market via top-down corporate edicts, not bottom-up improvements.
One wonders why they have not learned from this mistake. They continue to make it again and again (AVX-512 and NVRAM are some more recent examples). If the ordinary joe can't get his hands on a box with the new stuff, he's not going to port his software to it or make use of its special features.
> The first-gen Itanium chip, Merced, included a modest processor for directly executing x86 32-bit code. This ran much slower than Intel's contemporary cheap x86 chips, so no one wanted that migration route. It also ran slower than using static translation from x86 assembler code to Itanium native code. So HP dropped that x86 portion from future Itanium chips. Itanium had to make it on its own via its own native-built software. The large base of x86 software was of no help. In contrast, DEC designed Alpha and migration tools so that Alpha could efficiently run VAX object code at higher speeds than on any VAX.
Seems like Apple learned from that. Both generations of Rosetta have top-notch performance. Hard to emulate bits were circumvented by just adding extra features to the CPU that directly implement missing functionality (e.g. there's an x86-like parity flag on the Apple M1).
Mill Computing thinks so, with their "The Mill" architecture. Its proponents have sometimes described it as "Itanium done right".
The Mill uses a variable length encoding at the bit level. They also avoid encoding destination registers. This alleviates one large drawback of VLIW: code density.
Itanium itself has ~42-bit instructions: combined with EPIC, even highly optimised, the code density was often less than half that of contemporary RISC architectures. Many Mill instructions are 16-24 bits wide.
Mill programs are also supposed to be distributed in an intermediate format (think LLVM-IR, WebAssembly or ANDF), to be compiled at install-time.
This is supposed to decouple the platform in the long run from the actual instruction set on that particular CPU, allowing the instruction encoding to change between models.
However, Mill Computing has worked on their design for many years now without any product announcements. Many are afraid that its patents will expire or be sold before they get anything released.
What I personally like most about the architecture is not the promised performance but features for program security and microkernels.
A VLIW has a lot of functional units that are used simultaneously.* Modern processors (out of order, superscalar, using speculative execution, pick your techniques and buzzwords) allow these units to be used simultaneously and dynamically depending on the instruction mix. With VLIW some instruction units won’t be used and so won’t contribute to performance.
The compilers can give hints to the execution CPU but it also decides what to do when. With VLIW (EPIC is probably a better name in this regard) you have to guess right up front, without knowing what the data will be.
* So does SIMD, but in VLIW you don’t have to have a single instruction.
VLIW is pretty well represented in the TOP500 supercomputers and in various other performance niches.
What isn't is not so much VLIW as EPIC - the explicit parallelism of Itanium, which among others didn't really support out of order or branch prediction in ways other than compiler code generator.
High levels of SMT (sometimes with just a barrel execution models) are used in GPUs to smooth out the performance characteristics involved, from my understanding.
The main problem was compiler complexity -- the hoped-for "junk parallelism" gains really never panned out (maybe 2-3X?), so the compiler was best when it could discover, or be fed, vector operations.
But Convex (main competitor at the time) already had the "minisupercomputer vector" market locked up.
So Multiflow folded in early '90 (I had already bailed, seeing the handwriting mural) after burning through $60M in VC, which was a record at the time, I believe.