AMD Killed the Itanium (2005)

FullyFunctional · on Feb 4, 2023

Yeah this isn't quite what happened. Firstly, Intel didn't start Itanium, HP did, as a successor to their HP Precision line. I forget how they got together, but it was a collaboration between Intel and HP, but HP started it and had largely the architecture defined before Intel got involved.

Secondly, it's true that AMD hammered the nails in the coffin, but AMD wouldn't have mattered if Itanic had been faster, cheap, and on time. Itanic was a disaster partly because of overly complicated design by committee and partly because of the fundamentally flawed assumption (that you don't need dynamic scheduling, AKA OoO processing).

I have an Itanium in the garage, a monument to hubris.

UPDATE: I forgot to mention that from the outside it might seem that Intel had a singular vision, but the reality is that there were massive political battles internally and the company was largely split into IA-64 and x86 camps.

UPDATE2: Itanium was massively successful in one thing: it killed off Alpha and a few other cpus, just based on what Intel claimed.

jasoneckert · on Feb 4, 2023

I've always thought that killing off Alpha in favour of pushing Itanium was one of the worst things Intel/HP could have done. Not only was Alpha more advanced architecturally, it was actively implemented and mature. With active development by HP, it could have easily snowballed into the standard cloud hardware platform.

jabl · on Feb 4, 2023

Alpha's fate, like the other proprietary RISC architectures that focused on the lucrative but in the end small workstation market, was sealed. With exponentially increasing R&D and manufacturing costs, massive industry consolidation was inevitable. That it was Itanium that delivered the coupe de grace to Alpha was but the final insult, but it would have happened anyway without Itanium.

And it wasn't like the Alpha was some embodiment of perfection either. E.g. that mindbogglingly crazy memory consistency model.

hajile · on Feb 4, 2023

Intel bought Alpha from HP/Compaq.

The biggest reason Alpha was a "workstation" chip was margin and R&D issues. It was fast, but they couldn't manufacture in high volume which drove per-chip costs much higher than they could have been if paired with a company like Intel. Meanwhile, complete dependence on manual layout for everything pushed development cost and time to market too far out. Once again, Intel's design tools could have helped reduce this overhead.

runroader · on Feb 4, 2023

I don't disagree with the main points, but Alpha wasn't just focused on the small workstation market. Alpha for lots of us in the IT departments of SMBs was the go-to when your Exchange server couldn't handle the load anymore. DEC and by extension Alpha died as soon as Ken Olson was pushed out.

_a_a_a_ · on Feb 4, 2023

> that focused on the lucrative but in the end small workstation market

I don't at all believe this was true.

> that mindbogglingly crazy memory consistency model

I guess the utterly competent designers were actually stupid eejits then? The memory consistency was AFAI could determine to reduce to the utmost the hardware guarantees and therefore hardware complexity. It was done for speed.

I have the greatest respect for the Alpha design team, they designed a thing of elegance and even beauty. You could learn a lot from it - I did.

jabl · on Feb 5, 2023

> I guess the utterly competent designers were actually stupid eejits then?

No, I don't think that. But I don't think they had some superhuman foresight either, and they made some decisions that in retrospect were not correct. And with the memory consistency model, they made the classic RISC mistake of encoding an idiosyncrasy of their early implementation into the ISA (similar to delay slots on many early RISC's).

> You could learn a lot from it - I did.

I used Alpha workstations, servers and supercomputers for my work for several years back in the day. They were good, but not magical, and even back then it was quite clear there was no long term future for Alpha.

_a_a_a_ · on Feb 5, 2023

I'm not a chip designer, far from it, but my understanding is that they really knew what they were doing and this was quite deliberate. See this comment https://news.ycombinator.com/item?id=17672467

I never suggested alphas were magical, but they did seem extremely good and they were designed for future expansion, it seemed to me they were killed off very much not by competitor supremacy.

Good answr, thanks

Tuna-Fish · on Feb 5, 2023

It was a deliberate choice, but it was a poor one. It made implementing a fast cpu easier, but it also made the consistency model very hard for programmers to reason about, and required much more explicit barriers than any other consistency model.

These barriers also meant that correct multi-threaded alpha code was no longer particularly fast, because you have to insert expensive memory barriers basically everywhere.

Had Alpha not died early, they would absolutely have eventually moved towards a more strict memory model. As it was, it was essentially an irrelevant architecture by the time people started really hitting all the pitfalls.

_a_a_a_ · on Feb 6, 2023

IIRC alpha-model memory barriers are still used in the linux kernel. That said, I can't find a clear statement of that so I don't know if it is true or was, or just my own memory.

> These barriers also meant that correct multi-threaded alpha code was no longer particularly fast, because you have to insert expensive memory barriers basically everywhere.

I don't buy it. MBs are for multi-core code, and in such code you typically do much work on a single core then have a quick chat with another core. So the MBs are there for the inter-core chatter only. In that case having fast monocore code is a big win.

jabl · on Feb 6, 2023

> IIRC alpha-model memory barriers are still used in the linux kernel. That said, I can't find a clear statement of that so I don't know if it is true or was, or just my own memory.

The various memory barriers and locking primitives are arch-specific code, and at least smp_read_barrier_depends() is a no-op on all architectures except Alpha. Apparently around the 4.15-4.16 kernels there was a bit of de-Alphafication going on which entailed removing much Alpha-specific code from core kernel code. Further in 5.9 {smp_,}read_barrier_depends() were removed from the core barriers, at the cost of making some of the remaining memory barriers on Alpha needlessly strong.

For more info, search for Alpha e.g. in

https://www.kernel.org/doc/Documentation/memory-barriers.txt

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

https://open-std.org/JTC1/SC22/WG21/docs/papers/2020/p0124r7...

https://mirrors.edge.kernel.org/pub/linux/kernel/people/paul...

_a_a_a_ · on Feb 6, 2023

Another cracking answer, I'll have a dive in this evening, thanks.

ghaff · on Feb 4, 2023

What would become Itanium was publicly announced about eight years before HP acquired Compaq.

cesaref · on Feb 4, 2023

I seem to remember Alpha had some very weird memory coherency behaviour on multi-processor machines, but apart from that it was excellent.

inkyoto · on Feb 4, 2023

Itanium was a second Intel VLIW design. The first one was the i860, which was a mixed bag of either being eye popping fast, if instruction bundles were handcrafted by a human, or being as slow as a dog if it was a compiler that emitted the code.

Perhaps, there was a belief back then that compilers could be easily optimised or uplifted to generate fast and efficient code, and that did not turn out to be the case. Project management and mismanagement certainly did not help either.

I wonder how a VLIW architecture would pan out today given advances in compilers in last three decades, and whether a ML assisted VLIW backend could deliver on the old dream.

pclmulqdq · on Feb 4, 2023

GPUs today have a more VLIW-like architecture than CPUs and almost every neural network accelerator is a VLIW chip of some kind. It's worked out really well.

The big problem is SMT, since it's hard to share a VLIW core between processes, while a superscalar core shares really well.

cwzwarich · on Feb 4, 2023

> GPUs today have a more VLIW-like architecture than CPUs

Older GPUs seemed more VLIW-like because they were descended from fixed-function rendering pipelines and essentially just exposed the control signals via instruction encodings. Over time, shader cores have become less VLIW-like, e.g. look at any reverse engineering of recent Nvidia architectures.

This makes sense for the same reason you give for SMT: if you're trying to execute from multiple instruction streams on the same execution units, it makes more sense to use small individual instructions rather than large puzzle pieces.

raphlinus · on Feb 4, 2023

I wouldn't describe a GPU architecture as VLIW-like, though there is some overlap. To me, the essence of VLIW is extracting instruction level parallelism by specifying lots of different operations in the same instruction word. To this end, the instruction word has to be quite big (128 bits on Itanium) to specify all those operations.

A modern CPU has the same goal of extracting ILP but accomplishes it in a very different way. The instruction stream is short instructions, each of which specifies a simple operation, and these are reassembled using sophisticated dynamic logic into micro-ops, which then get executed in a fairly similar fashion as a VLIW machine - there are a large number of ports (8 is fairly typical these days), each of which performs a separate operation such as arithmetic, load/store, branch, etc.

A GPU has a similar goal of extracting lots of parallelism but does it in a very different way to both VLIW and modern superscalar CPUs. Each instruction operates over a large SIMD vector - 32 is typical, but this varies from 8 (Intel SIMD-8) to 128 (Imagination & optionally Adreno). The instruction specifies many copies of the same operation, so doesn't have to be that big. On RDNA3 for example[1], the basic instruction size is 32 bits, but 64 bits is also common (see section 6.1 for a summary of scalar and 7.1 for a summary of vector encodings).

These instruction sizes are a bit bigger than typical for CPU, for two main reasons. First, there are a lot of registers (256 vector registers), so that needs a lot of bits to encode. Second, it's common to add extra operations such as negation or absolute value in the same operation. But these operations are generally fairly inexpensive modifications on existing data, not completely separate as in VLIW.

In general, execution on a GPU is in-order, so all the reorder buffers and other techniques of superscalar CPUs are not used. Instead of trying to extract as much parallelism as possible from a single thread, a GPU will use that transistor budget to splat more execution units (and thus more threads) on the chip.

[1]: https://developer.amd.com/wp-content/resources/RDNA3_Shader_...

egorpv · on Feb 4, 2023

ATI/AMD used VLIW-based TeraScale sometime in 2007-2011 [1], although some stories claim it started even earlier [2], but in any case they later dropped it in favor of RISC-based GCN. And I'm not sure if Nvidia ever had anything VLIW-based at all.

[1] https://en.wikipedia.org/wiki/TeraScale_(microarchitecture)

[2] https://www.anandtech.com/show/4455/amds-graphics-core-next-...

inkyoto · on Feb 4, 2023

Indeed. RISC-V instruction fusion is a little bit VLIW like. I wonder how it handles the fused instruction transition across CPU cores tho (I have not looked into it).

pclmulqdq · on Feb 4, 2023

Most superscalar cores do macro-op fusion, including ARM and x86. They don't transition the fused op - it usually executes in one cycle (which is the point of the fusion) so you transition either before or after it.

throwawaymaths · on Feb 4, 2023

> fundamentally flawed assumption (that you don't need dynamic scheduling, AKA OoO processing)

I'm still unconvinced this is fundamental. It certainly was flawed back then, but compiler theory has improved a LOT since then, we have polyhedral optimization, e.g. that we didn't have access to... You could probably optimize delay line technology that way.

JonChesterfield · on Feb 4, 2023

If you know how long data will take to go to/from memory then you can schedule pretty well.

If you don't know whether some value will hit in the L1 or the L3 cache there's wild variance on how long it'll take so you have to do something else in the meantime. On x64, that's the pipeline and speculation. On a GPU, you swap to another fibre/warp until the memory op finished.

Fundamentally the hardware knows how long the memory access took, the software can only guess how long it will take. That kills effective ahead of time scheduling on most architectures.

nullc · on Feb 4, 2023

I don't think building a threaded cpu that switches threads to hide memory latency would be philosophically incompatible with VLIW. I think the later generation itaniums were threaded too. VLIW might make it hard to share functional units, but don't-- the relatively simplicity of the core should let you have more copies of the whole shabang. The SMT would just be for hiding memory accesses not for higher functional unit utilization.

On x86 SMT gets a bad rap because its never been tremendously effective. It's much more effective in IBM's implementation on Power. It wouldn't be hard to imagine some Itanium SMT monster.

yosefk · on Feb 4, 2023

How many threads would it take to hide 100ns of DRAM access latency and what serial performance would we get?

nullc · on Feb 5, 2023

It doesn't have to be hidden completely, there are still caches.

With 2 unstalled threads you effectively halve it (in terms of throughput), 4 unstalled threads you effectively quarter it, etc.

Kon-Peki · on Feb 4, 2023

> It wouldn't be hard to imagine some Itanium SMT monster.

Itanium did have SMT

throwawaymaths · on Feb 4, 2023

Iirc caches mostly exist to help keep pipelines full. You could maybe imagine an architecture with deterministic memory access times from specified regions.

cwzwarich · on Feb 4, 2023

Such architectures exist (e.g. for signal processing). They just aren't good for running general-purpose software.

Kon-Peki · on Feb 4, 2023

Itanium sold, barely, and still has people using it. But, as you can imagine, not with general-purpose software.

By the 2nd or 3rd generation, Intel had made some changes that really improved performance a lot. Was the final generation of Itanium the 4th generation?

anonymoushn · on Feb 5, 2023

> Iirc caches mostly exist to help keep pipelines full

You can find out by disabling the stride prefetcher and observing the 90+% performance regression in all your software!

nordsieck · on Feb 4, 2023

> You could maybe imagine an architecture with deterministic memory access times from specified regions.

Sure.

But it's very easy to level down. And almost impossible to level up.

moonchild · on Feb 4, 2023

Applications tend to exhibit a lot of dynamic behaviour. Consider any graph munger. I do expect there is an interesting subset of applications, particularly those that have fairly narrow scope and that have been highly optimised, which could be effectively statically scheduled. But for general-purpose computation, I don't buy it.

Perhaps as part of the trend towards increasingly heterogeneous architectures, we'll see big VLIW coprocessors for power efficiency in certain serial workloads (GPUs are massively parallel but little VLIW coprocessors; however, they do have dynamic scheduling).

killerstorm · on Feb 4, 2023

Itanium also does dynamic scheduling. It's not really VLIW - it is just optimized to decode and schedule a lot of instructions at once.

FullyFunctional · on Feb 4, 2023

That’s not what dynamic scheduling means. What you need is to dynamically extract ILP so you can get good performance on single thread applications in light of branch mispredictions and cache misses. We are talking SPECint, not SPECfp nor throughput (threading). x86 chips and Apple Silicon show what you can do here and to do that with IA-64 would have been harder, not easier.

throw0101c · on Feb 4, 2023

> It certainly was flawed back then, but compiler theory has improved a LOT since then

Which would be fine if there was a time machine available to send back what we know now to the past. But at the time having sucky compilers for what you want to (try to) accomplish was a bad idea.

The same decision now may be good, but then it was a mistake. It'd be like trying to have a Moon landing project in the 1920s: we got there eventually, but certain things are only possible when the time 'is right'.

kjs3 · on Feb 5, 2023

Spot on. I think it's also important to point out that "the compiler will fix it" line wasn't new with Itanic. The Multiflow guys (for one) said the same thing for their Trace line of VLIW machines 15 years or so before HP & Intel, and that didn't work so well. So now, 40 years down the line, can the compiler really fix it, or do we acknowledge that maybe some things are theoretically nifty but don't work out in practice and move on? I don't know the answer to that, but I have my suspicions.

FullyFunctional · on Feb 4, 2023

compiler technology doesn’t help when cache misses and branches aren’t predictable. That’s the fatal flaw.

therealcamino · on Feb 6, 2023

This is the same argument that was being made in the 90s, ironically, that compilers were now good enough that it would work.

The thing is, on regular, array-based codes, it's great. And it was then, too -- the polyhedral approach was all being developed at exactly that time, but maybe it's not clear in hindsight because the terminology hadn't settled yet. Ancourt and Irigoin's "Scanning Polyhedra With Do Loops" was published in 1991. Lots of the unification of loop optimization was labeled affine optimization or described as based on Presburger arithmetic. But that is the technology that they were depending on to make it work.

But most code we run is not Fortran in another guise. The dependences aren't known at compile time. That issue hasn't changed much.

The one change now is that workloads that were called "scientific computing" then are now being run for machine learning. But now it doesn't make sense to run regular, FP-intensive codes on a CPU at all, because of GPUs and ML accelerators. So what's left for CPUs that excel on that workload? I'm not sure there is a niche there.

hajile · on Feb 4, 2023

The halting problem means that for typical programs, you can't prove control flow. Proving optimization means not only proving control flow, but then knowing which branches get taken most often so you can optimize. There may be a small subset of programs for which this is true, but the rest leave the compiler completely blinded.

OoO execution does an end run around this by examining the program as it runs and adapting to the current reality rather than a simulated reality. This is the same reason a JIT can do optimizations that a compiler cannot do. The ability to look 500-700 instruction into the future to bundle them together into a kind of VLIW dynamically is a very powerful feature.

As to compiler theory, it really isn't that advanced. Our "cutting edge" compilers are doing glorified find-and-replace for their final optimizations (peephole optimization).

Look at SIMD and auto-vectorization. There are so many potential questions the compiler can't prove the answer to that even trivial vectorization that any programmer would identify can't be used by the compiler to the point where the entire area of research has resulted in basically zero improvements in real-world code.

ilyt · on Feb 4, 2023

Because even if you get compilers to optimize for your current VLIW, that makes it harder to make improvements down the line.

Current way, while arguably pretty wasteful on all the micro-optimizing CPU does on incoming bytecode, allows designers to nearly freely expand hardware to meet the needs without having compilers to produce different code.

Findecanor · on Feb 4, 2023

I've heard a rumour that supposedly one of the lead designers of the IA-64 architecture had died prematurely mid-project. And that would have left the project without the man with the vision. Hence, design by committee.

_a_a_a_ · on Feb 4, 2023

Frank Dobberpuhl died in 2019.

Also the Alpha was the utter opposite of design by committee. A quick skim of the alpha ISA would show you that.

mepian · on Feb 4, 2023

IA-64 is the Itanium, not the Alpha.

_a_a_a_ · on Feb 4, 2023

You are right, I misread, my bad.

amrb · on Feb 4, 2023

Talk about bad omens!

Findecanor · on Feb 9, 2023

Found where I heard it: <https://youtu.be/JS5hCjueqQ0?t=4054>

hinoki · on Feb 4, 2023

I thought the biggest problem with Itanium was the fact that it was optimising the wrong thing; it maximised single thread performance by going all in on speculative execution, but it turns out that optimising joules per instruction is much more important.

ghaff · on Feb 4, 2023

Certainly one of the issues with Itanium was it was fighting the last war.

It was trying to optimize for instruction-level parallelism when power efficiency and thread level parallelism were coming into vogue. Arguably companies like Sun overoptimized for the latter too soon but it was the direction things were going.

A senior exec at Intel told me at the time the focus on frequency in the case of Netburst was driven by Microsoft being uncomfortable with highly multi-core designs--and I have no reason to doubt that was one of the drivers. There was a lot of discussion around the challenges of parallelism, especially on the desktop, at the time. It generally wasn't the problem the hand wringing suggested it would be.

zozbot234 · on Feb 4, 2023

Parallelism is still very much underused on desktop. Most desktop CPU's are used much of their time for running single-threaded JavaScript from some clunky website - no parallelism whatsoever. It's only with the latest-gen CPU's that have things like big.LITTLE that it's becoming a real game changer.

nomel · on Feb 4, 2023

How does big.LITTLE change anything related to your scenario? From my understanding, they’re just high efficiency cores that influence parallelism no differently than any other multi core CPU.

hakfoo · on Feb 4, 2023

I think if you say to OS and software vendors "here are a bunch of toy cores, do something useful with them", it finally justifies a nudge towards smarter scheduling. With ordinary SMP, there's little reason not to just assign "first free core".

I always envisioned a day where we'd have devote the small cores to "parasitic load" tasks-- your media player, Slack/Discord/etc., and a thousand OS maintenance threads. They might run at 95% load to do not very much, but it's no big deal-- the actual software you cared about now has the big cores to itself, and context switching (along with cache and branch-prediction losses) are reduced. I could even imagine getting to the point where tasks could requisition cores on a "no disruptions until actually yielded by the main process" basis for real-time or maximum performance tasks.

hedgehog · on Feb 4, 2023

In the last few years this is pretty much where we've gotten to. Sometimes I leave Activity Monitor open on my M1 Mac and the four little cores keep the big cores mostly idle until there's a lot of work to do (a build or sigh scrolling a web page).

somat · on Feb 4, 2023

I suspect that long term big.little is not going to be all that important. It is more of a stepping stone along the way.

My logic goes something like "why have 4 small cores and 4 big cores when you could have 8 big cores". Then the argument goes "yes but the small cores use less power", to which my replay is "true, but the big cores finish faster and can spend more time sleeping, I think the power argument evens out ether way". The real gain is to have better power management for the cores, not by having weird small cores.

hedgehog · on Feb 4, 2023

There are some wrinkles to this: First, big cores are less efficient per unit work because there's a lot of complexity in things like reordering, speculation, and pipelining and that complexity costs energy. A design optimized for peak throughput per time looks different than a design optimized for peak throughput per energy. Second, reducing area and energy for the cores allows spending those resources somewhere else like larger caches which, to a point, can reduce energy further. Third, realtime workloads can't really be delayed without compromising user experience so you're going to be waking some cores up constantly anyway. Might as well have a few efficiency-optimized cores and pile as much on them as possible to keep latency and energy consumption low.

Narishma · on Feb 4, 2023

Small cores are much smaller, so the question is more "why have 4 small cores and 4 big cores when you could have 5 big cores".

giantrobot · on Feb 4, 2023

Not all tasks can finish fast. They're just constant low level jobs waiting on IO most of the time. Operating systems are full of suck tasks. It's better to let them run on the LITTLE cores since integrated over time they'll use less power than the big cores.

andromeduck · on Feb 4, 2023

doesn't work as well if you're optimizing for latency

ghaff · on Feb 4, 2023

I think it's generally fair to say though that the applications that really need a lot of performance (e.g. multimedia) multi-thread pretty well and there are typically a lot of background tasks running that consume core cycles as well. What is probably more generally true is that a modern laptop or desktop is ridiculously overpowered for most of what we throw at it. I'm typing this on my downstairs 2015 MacBook and it's perfectly fine for the almost entirely browser-based tasks I throw at it.

zdragnar · on Feb 4, 2023

My wife has only ever owned cheap Chromebooks, and has never complained that they were slow. I've used them with her streaming videos and such, and I agree- simple web browsing isn't slowing anything down on modern hardware.

Even on a modern ultralight laptop, I can run two chrome profiles, three instances of vscode running different projects, docker and a few other things and the CPU never gets pegged. There's a ton of memory pressure from a memory leak somewhere that I haven't bothered tracking down yet- I suspect the SWC compiler (thanks, rust) but haven't proven it yet. All that and I'm still getting 8+ hours of battery life.

pas · on Feb 5, 2023

Which notebook model is this? Which OS are you running on it? Would you buy it again given the same budget today? (Or said in another way, is there something better now?)

zdragnar · on Feb 7, 2023

I'm running an lg gram, with the swaywm flavor of Manjaro. So far, everything hardware-wise has worked pretty flawlessly, though given the option, I'd rather have 32 gigs of ram than 16.

I'm really tempted by the idea of a Framework style laptop with user serviceable parts, but my work style has me moving a fair amount, so not having battery or thermal issues is such a boon I don't know that I could make the switch. In the last 6 hours I've written and compiled code, run tests, attended video calls, streamed video from websites, browsed the internet for recipes for dinner, chatted on slack and am still on 46% battery remaining. I have yet to hear the fans turn on.

The two areas this thing will fall down on is music and gaming. The speakers are pretty bad, and though it can run light games off of steam, I doubt it would do well with anything super graphically intense (though I haven't actually tried much, to be honest). Also, the built-in webcam sucks, but decent webcams and headphones are cheap, so it's really only games that you'd want something else for.

pas · on Feb 8, 2023

Thanks!

I've looked at the LG gram before (the 2021 model [0]), but wasn't convinced. And now after seeing a friend's new Lenovo legion 5 I'm even more uncertain of which one should I pick. (That Lenovo has a handy button to set the power envelope, which seems to actually work.)

I also usually move a lot, but I don't want to optimize for that. It's easier to find a power outlet than to cool a throttled laptop.

[0] https://www.rtings.com/laptop/reviews/lg/gram-17-2021

zdragnar · on Feb 8, 2023

I can't speak for how Windows behaves on it, but I've never noticed any throttling. The only time the fans have kicked on is when the battery is plugged in and charging. As I said before, though, I also don't really game on it.

I don't know if / how it runs linux, but I've got a friend with a legion, and was also happy with it.

If you'd rather have better graphical performance than battery life, definitely go with the legion. If you can't stand the thought of being tethered to a power cord every time you have to do something serious, then you might want to consider the gram.

I've had gaming laptops before, and after putting the battery through the ringer after awhile it was a struggle to get 4 hours unplugged, which I simply didn't want to deal with again.

pjmlp · on Feb 4, 2023

It is heavily used by all those processes screaming for CPU attention, though.

One can really notice it on dual cores.

hajile · on Feb 4, 2023

Fun fact: Intel had a couple CPUs codenamed Tejas that were going for 50-ish pipeline stages and 7-10GHz before being abandoned as basically impossible.

https://www.youtube.com/watch?v=qzZfkbHuB3U

ptx · on Feb 4, 2023

> the fundamentally flawed assumption (that you don't need dynamic scheduling, AKA OoO processing)

Could this have worked better with JIT-compiled applications, e.g. Java given a sufficiently clever JVM, where assumptions can be dynamically adjusted at runtime?

(Edit: As opposed to an AOT compiler.)

acdha · on Feb 4, 2023

This was the grand hope but it never panned out. It’s possible now that a sufficiently brilliant compiler could make a difference since there was nothing like LLVM at the time and GCC was far less sophisticated. One of the many acts of self-sabotage Intel committed was insisting on their hefty license fees for icc, which meant that almost all real-world comparisons were made using code compiled using GCC or MSVC, which were not as effective optimizing for Itanium. There’s no way they made enough in revenue to balance out all of those lost sales.

The other point in favor of this approach now is that far more code is using high-level libraries. Back then there was still the assumption that distributing packages was hard and open source was distrusted in many organizations so you had many codebases with hand rolled or obsolete snapshots of things we’d get from a package manager now. It’s hard to imagine that wouldn’t make a difference now if Intel was smart enough to contribute optimizations upstream.

ghaff · on Feb 4, 2023

Yes. Open source, high-level libraries, SaaS/Cloud, good dynamic translation (e.g. Rosetta), etc. make the sort of backward-compatibility that Intel/HP failed so miserably in providing much less of a big deal today. One of the driving forces behind Itanium was that, not only was developing custom microprocessors and OSs for a single company expensive, but even once you'd made that investment, ISVs were reluctant to support your low volumes for any amount of love and money.

acdha · on Feb 4, 2023

It’s definitely interesting looking at ARM now. It’s helped by having consistently had much better price/performance but also the fact that things like phones meant a ton of the primitives people would need to switch server applications were already taken care of. Intel really would have been better off cutting their marketing department and hiring 50 more developers to work on open source like GCC, OpenSSL, Linux, etc.

ghaff · on Feb 4, 2023

Intel actually has a lot more software development than they're generally credited with. It's mostly "just" hardware enablement but when the Linux Foundation was still providing external numbers on Linux kernel contributions by company, Intel was one of the very top contributors.

With respect to ARM, Intel pretty much blew it, especially on mobile. They were so determined to exploit their x86 beachhead. I remember at an IDF, they were even trying to make a case for how it was important to run x86 everywhere so that Flash would run consistently.

acdha · on Feb 4, 2023

I’ve often felt that Intel’s embrace of open source is someone wanting not to repeat the Itanium loses. They seem to have a much better relationship with key open source projects now.

inkyoto · on Feb 5, 2023

> […] and GCC was far less sophisticated

With all due respect, this is simply not true. Especially for the time, GCC was the most sophisticated compiler out there, and the only one that could be easily retargeted to a new or another platform due to the use of the intermediate representation language (IRL). A new code generation backend could be boostrapped within days using the IRL. Cross-compilation for any supported target platform whilst running on the same host was also only possible with GCC. There were no other known precedents at the time (I am not counting pcc, the portable C compiler, due not being comparable to gcc).

In terms of the code generation, GCC was quite up there as well albeit performance and quality of the generated code varied across platforms, sometimes wildly. E.g. the native Sun C compiler generated consistently faster code for SPARC (although their C++ compiler that they had acquired from a third party was buggy as hell).

For Itanium, the GCC Itanium backend was not efficient, and it was a well known problem. On 32-bit x86, GCC generated faster and better code than most commercially available C/C++ compilers with a few exceptions being Intel and Metaware High C compilers (both of which not being widely available and being exorbitantly expensive for a average developer) and being comparable or faster to the Watcom C compiler (Watcom did have an edge over GCC on producing much faster floating point code and the default C struct alignment rules, and GCC had an edge due to allowing control over how many CPU registers could to pass input parameters into a function). Id Software, with the release of Quake 1, had ditched the Watcom C compiler and replaced it with GCC (DJGPP) due to GCC generating better code (I think John Carmack wrote about it at some point). GCC did not support Windows well, though.

GCC and LLVM have both been very sophisticated compilers albeit pursuing different objectives. LLVM appeared due to disagreements over the GCC licensing that RMS was insisting on – to preclude GCC from becoming extensible and allow 3rd parties to produce closed source plugins. So, LLVM was conceived as a modularised and extensible design being more conducive to research, experimentation and extensibility at the expense of supporting fewer platforms. GCC and LLVM have both eventually caught up and have now largely reached the feature parity with each other (with GCC still supporting a larger number of platforms and being a go to choice for embedded development).

acdha · on Feb 5, 2023

My point in that sentence was that GCC in the late 90s was less sophisticated than it is now. As you noted it was also not the best for Itanium (also Power and, if memory serves, Alpha) which meant that the large amount of software which was compiled with it for compatibility & ease of support look disproportionately worse. For Itanium that was a huge problem since it relied so heavily on the compiler.

josefx · on Feb 4, 2023

Wasn't icc rather popular at some point? AMD later suffered in benchmarks where it came out that icc generated code ignored CPU feature flags when the CPU vendor id did not match Intel.

acdha · on Feb 4, 2023

That depends on how you define popular. It was never used for a majority of compiler runs since e.g. no open source project could use it but performance-sensitive users definitely licensed it.

One concern I remember was correctness: one company I worked at didn’t find a benefit worth dealing with a second compiler’s quirks and IIRC some scientists I supported evaluated it but never used it because some of their model output varied (classic floating point drift).

jonhohle · on Feb 4, 2023

Wasn’t that always the issue with Itanium - that it could have been fast with a sufficiently clever compiler? The problem seemed to be no one was clever enough to write that compiler.

ptx · on Feb 4, 2023

That's what I've heard as well. But a JIT compiler (like the JVM) might not have to be as clever as an AOT compiler, as it can change its optimization decisions later, so perhaps that might have been more feasible?

formerly_proven · on Feb 4, 2023

PA-RISC seemed fairly neat from the somewhat limited information you can find online (there’s a 1.1 and 2.0 ISA manual on kernel.org). Where there major issues with the ISA? Killing your entire product line and starting over of course rarely worked.

rsaxvc · on Feb 4, 2023

The PA-RISC 1.1 ISA encoded specific implementation details that didn't age well, like the branch delay slot and instruction address queues. And required in-order memory accesses, because there wasn't support for cache coherent IO.

ghaff · on Feb 4, 2023

And, at a higher level, Unix vendor-specific processor designs were on the way out. Designing and manufacturing a processor just for your relatively low volumes in the scheme of things Unix systems was just way too expensive.

One of the other problems with Itanium was that it was supposed to be an "industry standard" 64-bit processor. But Intel and HP were never quite able to square that with a situation where HP at least saw themselves as more equal than others given their role in the design.

pjmlp · on Feb 4, 2023

HP-UX 11 was one of the UNIXes I worked on (1999 - 2002, 2005), and the only issue I had, at least during the first employer was their ongoing transition to 64 bits, and the C compiler we had available (aC) was a mix of K&R C with some ISO/ANSI C compliance.

I don't recall any issues with the ISA, and we really liked using its early container capabilities (HP Vault).

kjs3 · on Feb 5, 2023

but AMD wouldn't have mattered if Itanic had been faster, cheap, and on time

I dunno about that.

My own personal opinion is that Intel has never been able to re-architect their way out of the fact that the cornerstone of their success is that they are selling x86, and their customers mostly don't care about the theoretical advantages of the bright shiny new thing. They just want to run their software just like they always have. There's a reason why IA-64 has joined iAPX432, i860, StrongARM and i960 as Intel footnotes (outside of the embedded market).

When they were philosophizing about what Itanic should look like, the only thing that x86 obviously needed from a market perspective was a bigger address space. And AMD was smart enough to deliver on that, and here we are.

noNothing · on Feb 5, 2023

Itanium didn't kill off Alpha. Intel x86 pricing did. But the most important unheeded lesson in those days was software compatibilty. We went from the days of each computer having its own word length, instruction set, heck, even data format (remember the endian wars?) to source compatibility to binary compatibilty. We learned that for most useage, software stability that allowed taking advantage of Moores law, was seriously more valuable in most cases than gaining a bit more performance or price/performance by changing architectures.

Intel kept the X86 price at a point where no bean counter would favor investing in new architectures. Fortunately AMD broke the headlock on x86.

sliken · on Feb 6, 2023

Well Intel's original plan was to keep x86 32 bit, forcing anyone that needed more into IA64. Fortunately AMD came out with x86-64, and when it was clear that IA64 wasn't going to be competitive, Intel brought x86-64 to their chips.

killerstorm · on Feb 4, 2023

If you look at SPEC CPU benchmark, Itanium was not bad at all in terms of instructions-per-cycle. IIRC in fp performance it could beat Netburst Pentium4 running twice its clock speed, and even compared favorably to Core. I.e. if Intel produced Itaniums which ran at the same clock speed as Core CPUs, it would be world's fastest single-threaded number cruncher.

So I don't really buy the "Itanium is bad architecture" story.

It's fate was probably decided around 1999-2000. At that point Itanium was still pretty good against Pentium 3 and Pentium 4. And name "IA-64" indicates Intel didn't plan to make 64-bit Pentiums. So eventually Pentiums would fill low-end segment while the rest would be occupied by 64-bit Itaniums.

AMD killed that plan by releasing AMD64 architecture. It was an obvious upgrade to x86, so it would clearly do better in the market than IA-64. So Intel decided to go for x86-64 too, and Itanium was doomed at that point. They didn't even bother making Itaniums with same clock speed as Xeons.

So it's definitely possible that if AMD decided to stick to 32 bits at that time, Intel would have pushed optimized IA-64. Also AMD64 could be worse than it is. E.g. if they decided to increase only register size but keep the number of registers the same, IA-64 could still come on top.

giantrobot · on Feb 4, 2023

> AMD killed that plan by releasing AMD64 architecture. It was an obvious upgrade to x86, so it would clearly do better in the market than IA-64.

One of the best aspects of the Opteron was it also happened to be a fantastic 32-bit CPU in addition to AMD64. This was a period where a lot of software, even FOSS wasn't 64-bit clean. There was a lot of pointer arithmetic hiding deep in libraries that were assuming pointers would always be 32-bits.

The Opteron running a 32-bit OS at least as well as a 32-bit Athlon was a huge point in its favor. So your existing system running on new Opteron hardware ran fine and you could mix and match Xeon and Opterons in a fleet. Then switch over to 64-bit on the Opterons for (hopefully) better performance.

acdha · on Feb 4, 2023

One thing to remember is that FP performance with well-scheduled code was by far its strongest performing area, and Intel put a lot of work into tuning their compiler for those specific tests. The problem was that it fell off heavily the less your code is like that, especially for the branchy code most business apps depend on.

The other big problem was that the x86 compatibility story was worse than the earlier hopes. That meant that it not only wasn’t competitive with the current generation competition but often even the previous or worse - note losing to the original Pentium or even a 486 here:

https://tweakers.net/reviews/204/8/intel-itanium-sneak-previ...

Now, they could have improved that but statistically nobody was going to pay considerably more for lower performance in the hopes that a future update would improve matters.

The Athlon and Opteron weren’t just fast, they also had flawless 32-bit support so even if your 64-bit software update never happened you could justify the purchase based on their price/performance.

rocket_surgeron · on Feb 4, 2023

>I have an Itanium in the garage, a monument to hubris.

I love monuments to hubris so I too have one in my garage-- a four-node SGI Altix 350.

xorcist · on Feb 4, 2023

Itanium can be regarded as a huge success. Maybe some don't remember, but the minicomputers and unices were where all the money was. Sure, Intel had the process edge, and had completely cornered the microcomputer market. But PCs had slim margins and didn't really matter in the bigger scope important data processing.

Many of the computer nerds watched in awe as vendor after vendor dropped their hugely expensive and engineering heavy custom CPU architectures and lined up behind Intel. IBM was the only big player who didn't swallow the bait. "Even if they fail, that's a huge success" was a common observation at the time.

And sure enough. I don't think they failed on purpose, but business wise it was a win-win situation. The x86 architecture would have won anyway, because of the sheer scale, but the Itanium wreckage hastened it. Everyone needed to move, so why not move to x86/Linux directly?

andromeduck · on Feb 4, 2023

Comment I saw on YouTube a while back:

> In some ways Itanium was the most successful bluff every played in the tech industry. In much the same way that Reagan's Star Wars bankrupted the Soviet Union got almost every single competitor to fold. Back at the beginning of the project, Intel was nowhere in high-end & 64-bit computing. There was HP (PA- RISC), Sun (Sparc), Dec (Alpha), IBM (Power), MIPS (SGI). Intel wisely picked the partner with the stupidest management (Carly) to give up their competitive edge and announce to analysts that Intel's vision/roadmap is so awesome that RISC is dead and that they're going to follow the bidding of their master Intel for their 64-bit plan. Wall Street bought in to the story so much that almost everyone else with competitive chips folded their strong hands to Itanium's bluff - SGI spun off MIPS and MIPS decided to leave the high-end space. Compaq undervalued Alpha and let it die. Sun tried to become a software company and if it weren't for Fujitsu making modern sparcs, sparc would be dead.

> Basically, with nothing but PR and Carly's stupidity, Intel wiped out over half of the high-end computing processor market.

> Thankfully AMD had the vision to see through the bluff, and saw the opportunity for 64-bit computing that worked; and thankfully IBM didn't have someone like Carly around so they saw the value in retaining competitive advantages; or the computing world would be pretty bleak place right now..

neilv · on Feb 4, 2023

For anyone who doesn't know but is curious, "Carly" refers to Hewlett Packard (HP) CEO Carly Fiorina. Fiorina oversaw HP's acquisition of Compaq, which had previously acquired DEC, IIRC. Reportedly, the HP-Compaq acquisition was opposed by many, including board members and family. (This was before my time, but I read a lot of trade rags as a kid, and then later occasionally heard insider stories.)

HP was legendary for culture, like "management by walking around, and talking to the people on the ground", which was different from Fiorina's style.

Compaq was the most noteworthy IBM-compatible PC company, before Dell's dorm room dirt-cheap generic PC clones business skyrocketed into an empire.

DEC was the maker of the PDPs and VAX-based minicomputers on which much of the field of Computer Science was arguably developed, and later MIPS- and then Alpha-based workstations and servers, while also still developing VAXen (the plural form of the word).

All those proprietary CPU ISAs listed (PA-RISC, SPARC, Alpha, POWER), when they were introduced on engineering/graphics workstation computers, were especially exciting, because -- separate from the technical architecture itself -- they would briefly probably be the fastest workstation in your shop. All of these made MS-DOS/Windows PCs and Macs look like toys by comparison (though, eventually, Windows NT 3.51 started to be semi-credible if you just needed to run a single big-ticket application program). And you didn't know what exciting new development would be next.

Maybe it was like if, today, several makers of top-end gaming GPUs resulted in a leapfrogging on a cadence of every few/several months. And if they had different strengths, and, incidentally, curious exclusive game software to explore. Or like the very recent succession of Stable Diffusion, ChatGPT, etc., and wondering what the next big wow will be, what they've done with it, and what you can do with it.

When I knew some Linux developers working on Itanium, some were already calling it "Itanic". (I didn't read much into the name at the time, because were a lot of joke derogatory names for brands and technologies.) Later, I thought "Itanic" was because it was a huge expensive thing that was doomed to sink. The theory in the TFA sounds like most competing ship companies gave up on their own engine designs when they heard how great the Titanic would be.

mrweasel · on Feb 4, 2023

The whole thing is weird. The Itanium is as much an HP project as it is an Intel one. While HP and Intel are busy creating the IA64, Compaq is licensing the EV6 bus to AMD, for use in the Athlon. If the Athlon hadn't been a success, I don't believe that AMD would have had the funds to develop the first AMD64 processors.

Then a few years later, Compaq is bought by HP, which does nothing with the remaining DEC/Alpha IP, the same tech that helped AMD build the Athlons.

giantrobot · on Feb 4, 2023

> Compaq is licensing the EV6 bus to AMD, for use in the Athlon.

AMD was also hiring away many Alpha developers IIRC.

matwood · on Feb 4, 2023

I used to lust after the Alpha a friend of mine had. The raw compute performance increase from year to year was amazing during those times.

neilv · on Feb 4, 2023

A grad school officemate had previously worked at Thinking Machines, so was familiar with exotic supercomputers, but I think all the compute for his dissertation ran on a blue Alphastation or Alphaserver "footrest" under his desk.

mepian · on Feb 4, 2023

The HP-Intel joint effort to develop what became the Itanium was announced 5 years Fiorina became the CEO of HP. During that time she was working at AT&T/Lucent and had zero input in HP's strategy.

mepian · on Feb 4, 2023

*5 years before

Typed it on a phone and noticed the error only after the edit timer expired.

xorcist · on Feb 4, 2023

The comparison to star wars is certainly apt. It doesn't need to work, in the engineering sense, to be useful. Sometimes economics can trump engineering.

An important part of the PR machinery was that by picking up a hot topic from academia, they really got absolutely everyone to talk about VLIW as the next genreration RISC. And everyone already knew that RISC was superior and x86 was a toy, but which also was mostly true at time.

In the end, what won was huge caches and huge OoO pipelines. Linus Torvalds had some strong opinions and well known opinions on this, which turned out to be mostly right.

pavlov · on Feb 4, 2023

It’s pretty weird to frame it as “Carly’s stupidity” when all of HP’s competitors mentioned ended up worse. Smacks of misogyny to be honest.

slacka · on Feb 4, 2023

Do you know anything about HP's failure under her rule? I don't care about her gender, but I do know for a fact that they lost market share and had a massive brain drain under her tenure. Her massive layoffs included axing their R&D. Why don't you read this before jumping to conclusions?[1] It identifies many of the issues I saw first hand during that period, while still trying its best to find reasons to praise her.

[1] https://www.mercurynews.com/2010/04/20/analysts-carly-fiorin...

pavlov · on Feb 4, 2023

The comment to which I was replying claimed that “Carly’s stupidity wiped out the high-end processor market.”

Whatever her other faults as a CEO, that’s just not what happened with the Itanium. The writing was already on the wall for high-end Unix in the mid-1990s.

HP teamed up with Intel and had them take over the bulk of R&D expense with HP continuing to extract profits from the shrinking market for over a decade. Meanwhile the competitors DEC and SGI and Sun basically went out of business. (IBM of course retained its niche as the only choice for those who only buy IBM.)

nomel · on Feb 4, 2023

This would have been a more appropriate comment, than calling someone misogynistic.

pavlov · on Feb 4, 2023

A misogynistic tone is recurrent in online comments about Carly Fiorina’s time at HP, and in my opinion the comment blaming her stupidity for Itanium was in that vein.

Nobody talks about Sun’s contemporary leadership using phrases like “that dumb hockey jock Scott ruined Sparc.” Somehow it’s ok when the CEO was a woman.

KerrAvon · on Feb 4, 2023

These two things are not the same.

You don’t hang around with enough ex-Sun people if you haven’t heard derogatory comments about McNealy. But his ultimate failure at Sun wasn’t the same scale, and Sun was never as well managed or universally revered as HP.

ghaff · on Feb 4, 2023

I guess I hang around with different ex-Sun people because, however Sun ended up eventually, they're all pretty praising of McNealy and Sun's culture.

One thing McNealy did get right is that Sun was pretty much the only one of the large Unix vendors that wasn't at least preparing for the possibility of an all-Microsoft future with NT. (IBM was arguably placing more of a small side bet that execs like Mills didn't really believe in but almost no one besides Sun dismissed NT out of hand.

pjmlp · on Feb 4, 2023

Yet Solaris is gone for all practical purposes, while Aix is still seeing new sales beyond plain maintenance, with increase in 2022.

acdha · on Feb 4, 2023

I know that she was blamed for that considerably more than her male predecessors who set them on that trajectory. She definitely isn’t blameless but I would pause to question why so many men are so quick to shift the blame to the only woman available as a scapegoat.

I have an acquaintance who has been at HP for ages and his characterization is more that she was left holding the bag.

ghaff · on Feb 4, 2023

I remember an internal email thread about HP and Fiorina at the analyst firm I used to work at and, at one point, one of my colleagues wrote with exasperation "What would you have them do? Bring back Lou Platt?"

Hurd did seem to right the ship when he took over. But, to the degree many of us didn't really recognize at the time, a lot of that was financial engineering and eating of seed corn.

KerrAvon · on Feb 4, 2023

Talk to the HP engineering staff who were there. It was irreparable, and it was Fiorina who did it. Or so they generally claim.

TheOtherHobbes · on Feb 4, 2023

HP could have killed the PC market with Alpha, spending all that Itanic development money on transitioning PCs from x86 to Alpha via an emulator, and then promoting native software. Apple have pulled this trick three times now, with great success.

Instead Intel/HP nuked the entire mid/high-end of the industry including their own project and set computing back by a decade or so.

She was also a notoriously terrible CEO for other reasons. And then tried to jump-start a political career with one of the worst campaign videos ever made.

https://www.youtube.com/watch?v=rKWlOxhSIKk

dan-robertson · on Feb 4, 2023

I would be curious to read what people at DEC/Compaq/HP thought at the time about this because presumably people on the Alpha team would have thought of this idea. IIRC the Alpha could run x86 (with automatic translation) faster than the latest Intel chips[1] but then Intel got sufficiently good at the whole out-of-order thing (and I guess at the process of making chips in general) that they took the lead. Maybe there are good reasons that the people working on the Alpha thought they couldn’t win?

I’m particularly interested in the Alpha because it seems like the thing was designed with many of today’s CPU performance challenges in mind. E.g. simple stuff like 64-bit but also things like caches and multiprocessing (cf the very weak concurrent memory model). See also [2]

[1] eg see this from ’97 claiming that 0.5GHz Alpha could run x86 at comparable speed to a 2GHz PII https://www.usenix.org/legacy/publications/library/proceedin... but I’ve also seen claims that the Pentium Pro of ’95 would have been faster than this

[2] https://dl.acm.org/doi/10.1145/151220.151226 and I think the same here: https://www.hpl.hp.com/hpjournal/dtj/vol4num4/vol4num4art1.p...

scrlk · on Feb 4, 2023

Re: Alpha team - as a side note, two ex-Alpha engineers played a pretty pivotal role in the processor landscape we have today:

Dan Dobberpuhl - founded PA Semi, acquired by Apple in 2008 and kicked off Apple's custom processor strategy.

Jim Keller - arguably the most famous processor architect alive today (Athlon 64, AMD Zen, Apple A4/5).

ghaff · on Feb 4, 2023

Alpha was six feet under by the time HP acquired Compaq which had acquired DEC.

Computing was in no way set back by a decade. The alternative to Itanium was, as Gelsinger has said publicly, an enhanced x86 Xeon--which is what Intel ended up doing (and which HP subsequently adopted to run HP-UX and it's other enterprise OSs).

Obviously ARM has won out over x86 on mobile and--in a limited way--on the desktop. ARM's footprint will probably increase. We'll see. Then there's RISC-V. But that's all basically RISC.

mepian · on Feb 4, 2023

Indeed it is thinly veiled misogyny, Fiorina joined HP just around Itanium's missed release date. I don't consider her a great leader, but there is no need to blame her for every failure of HP, the Itanium was conceived and almost completed by her predecessors.

avhception · on Feb 5, 2023

Wrongly assigning blame happens every day regardless of gender. While I'm sure there are many cases where it's done out of a misogynistic mindset, accusing someone of misogyny based on nothing more than the circumstance she is a woman and was wrongly assigned blame just rubs me the wrong way.

acdha · on Feb 4, 2023

It was a humiliating failure, not a success. The trend against those other architectures was already clear: as processor complexity went up, the costs of building them skyrocketed and most of those companies had no plan for the kind of volume you’d need to support them. Don’t forget that applied on both the hardware and software sides: a competitive compiler and optimized libraries were important.

This is why Itanium got traction: everyone knew that you needed volume to stay in the game. IBM had a strategy to get that with Apple & Motorola (PowerPC started in 1991), but HP did not have anything like that for PA-RISC. DEC might have gotten there if they’d had a more aggressive partner for the lower-end Alpha strategy but the merger killed any chance of that.

Since x86 was rising so fast, it might not be clear why Intel got involved. That goes back to the licensing rights: they couldn’t prevent companies like AMD from competing directly with them. Itanium was the attempt to close off that line of competition legally and they were willing to attack their own product margins to do it.

ghaff · on Feb 4, 2023

I never got anyone from Intel to actually admit to the licensing angle but it was pretty widely speculated about at the time.

acdha · on Feb 4, 2023

Good point. I don’t have any inside information but it was certainly common to characterize it as a long reaction to things like losing that 386 lawsuit.

ghaff · on Feb 4, 2023

Yeah. Whether or not everyone was smart enough to never actually write down such a thing in an email or memo, you know there were execs who were keenly aware of this and, even if not the deciding factor, presumably helped influence the decision.

jojobas · on Feb 4, 2023

Doesn't quite add up. SGI folded their CPUs (thanks Mr Belluzzo) before Itanium was even released, Sun offered their x86 server about the same time (and kept their SPARCs).

HP was the only casualty to Itanium, but that was self-inflicted.

sliken · on Feb 6, 2023

Yes, but the CPU pipeline is years long. So years before Alpha, Sparc, MIPS, PA-Risc and related CPUs had to decide if the R&D for a next generation CPU made sense in the face of the announced Itanium which most believes was going to dominate the server industry.

When the Itanium shipped years late and slower than expected it was too late for any of the competition (except Power) to recover. Granted the x86-64s were ramping up and they would have all had tough competition, even without Itanium.

inkyoto · on Feb 4, 2023

To be completely honest, MIPS CPU's had never been known for their speed – not until the release of the 64-bit MIPS architecture anyway when they finally became competitive with other RISC CPU's, but it was a complete ISA redesign.

Sun was in a somewhat similar boat with the SPARC v8 architecture, and they were rather late with UltraSPARC (SPARC v9 ISA). Yet, they managed to hold out longer due to having a switched memory controller and a very wide memory bus, which allowed them to become the best hardware appliance to run the Oracle database (despite being less performant), and divert the cash flow into the UltraSPARC development. UltraSPARC I was underwhelming, and with UltraSPARC II they finally caught up with other RISC vendors and gradually started outperforming some (e.g. MIPS) in some areas.

Amusingly, the 512-bit wide memory bus has made a comeback in Apple M1 Max laptops (laptops!), and M1 Ultra has a 1024-bit wide memory bus.

Why HP went all in with ditching their own perfectly fine PA-RISC 2.0 architecture is an enigma to me tho.

hulitu · on Feb 4, 2023

> To be completely honest, MIPS CPU's had never been known for their speed

SPEC would like to dissagree with you.

inkyoto · on Feb 5, 2023

To be crystal clear: my statement pertained to 32-bit MIPS CPU's and MIPS CPU's predating the R8000. They were very slow. 32-bit MIPS CPU's did not have multiply and division instructions, they had branch delay slots that are unwieldy for a compiler to generate the efficient code for, plus other stuff. Early 32-bit MIPS implementations did not even have a hardware TLB (it was implemented in the kernel), which made the context switching between kernel and user spaces slow.

R8000 (MIPS IV) was fast and later MIPS64 CPU's were very fast, especially on floating point operations, and consistently outperformed competing 64-bit RISC and x86 CPU's because the MIPS64 was an ISA redesign that addressed and fixed many of the problems of the 32-bit version of the ISA.

jojobas · on Feb 5, 2023

He has MIPS as not known for their speed and SPARC getting ahead as a major achievement in the same post.

inkyoto · on Feb 5, 2023

That is a very frivolous interpretation of what I said.

I was comparing: a) 32-bit MIPS CPU's with 64-bit MIPS CPU's, b) 32-bit SPARC v8 and 64-bit SPARC v9 (UltraSPARC) CPU's, and c) performance of 32-bit RISC CPU's comparatively to each other. 32-bit MIPS and SPARC v8 CPU's were slow, with MIPS32 being one of the slowest across the entire board.

I was not comparing MIPS64 to UltraSPARC II or III because MIPS64 implementations (especially R10k and R12k) were exceptionally highly performant, especially in numeric computations that UltraSPARC CPU's were not known for at the time. UltraSPARC II/III systems were renowned for very high, sustained overall system throughput, and nor for high CPU computational performance.

At the time, if one wanted a number crunching beast, they had a choice of either MIPS64, or PA-RISC 2.0, or POWER CPU's. Mostly either MIPS64 or PA-RISC 2.0 (I am not including DEC Alpha – another early performance contender – because it perished too prematurely in the acrid belly of Compaq/HP acquisition shenanigans and did not get a chance to advance past 21264).

jojobas · on Feb 4, 2023

> Why HP went all in with ditching their own perfectly fine PA-RISC 2.0 architecture

Again, thanks Rick Belluzzo.

bell-cot · on Feb 4, 2023

> ...can be regarded as a huge success...

So long as one puts big, fat "giga-money-losing" and "humiliating" disclaimers on "success", then yes.

Vs. - what if, instead of Itanium, Intel had more-quietly designed and delivered good, high-performance x86-64 CPU's? I'm thinking that, by bottom-line metrics, would have been a vastly more successful business strategy.

dsr_ · on Feb 4, 2023

Well, that's what AMD was doing. Intel's pride couldn't let them legitimize that strategy.

Meanwhile, ARM was designing little low-power RISC toys that were obviously no danger to Intel at all.

detaro · on Feb 4, 2023

Itanium was announced (and supposed to have shipped!) well before AMD announced x86-64

bell-cot · on Feb 4, 2023

THIS. Per Wikipedia:

"AMD originally announced AMD64 in 1999[14] with a full specification available in August 2000"

vs.

"In June 1994 Intel and HP announced their joint effort to make a new ISA..."

_ph_ · on Feb 4, 2023

Of course the whole story is way more complex, but I have been saying for years that the Itanium might have succeeded, if AMD had not extended x86 to 64 bit. The 64 bit extensions did not only fix some of the x86 problems (it increased the register count, pushed 64 bit double floats) but it made x86 a choice for the more serious compute platforms and servers.

Back then, the whole professional world had switched to 64 bit, both from a performance and memory size perspective. That is why the dotcom time basically was based on Sparc Suns. The Itanium was way late, it still was Intels only offering in the domain. Until x86-64 came and very quickly entered the professional compute centers. The performance race in the consumer space then sealed the deal by providing faster CPUs than the classic RISC processors of the time, including Itanium.

It is a bit sad to see it go, I wonder how well the architecture would have performed in modern processes. After all, an iPhone has a much larger transistor count than those "large" and "hot" Itaniums.

rythie · on Feb 4, 2023

The industry was already moving away from the big 64 bit SMP machines made Sun, SGI & IBM. In many cases a cluster of 32bit x86 machines made more sense than one expensive big machine with high priced support contracts and parts. 32 bit x86 machines already supported more than 4GB total memory with PAE, it was just that one process couldn’t use more than 4GB. Other 64bit chips were already well established (SPARC, POWER, MIPS), probably for most of the users they couldn’t easily move to a new CPU architecture. For other users by the time they needed the bigger machines x86 64bit was already available, including from Intel themselves. AMD was limited 8 sockets from what I remember, so their was still a small market for big Itanium systems (like SGI’s Altix).

rwmj · on Feb 4, 2023

At the time there were computers containing Alpha chips with quite a PC-ish design. I nearly bought one, so they were semi-affordable. They ran Linux well. So it seems a bit more likely that these might have succeeded if AMD hadn't extended x86.

What you also have to remember is that Itanic was a very weird architecture. It's hard to write compilers for it, and it made the cardinal error of baking microarchitectural decisions into the ISA.

inkyoto · on Feb 4, 2023

Yeah, by mid-2000s most RISC workstations (Alpha, Sun, SGI, PowerPC tho not IBM POWER and certainly not the top of the range datacentre class servers) had converged on the mainstream PC architecture, e.g. a PCI bus, EIDE disk controllers, standard PC memory (168-pin DIMM's and such). I had a Sun Ultra-10 as the sole «PC» at home for some years running Solaris and later Linux. After quickly getting fed up with the standard Sun keyboard, I bought a generic, no-name USB PCI card, plugged it into my Ultra-10 and connected a Microsoft Natural keyboard. It just worked, with Solaris not even requiring extra drivers or kernel modules by virtue of being a USB keyboard.

pjmlp · on Feb 4, 2023

That is also my point of view, Itanium only failed because Intel's competition was allowed access to x86 design and improve upon it.

Had it not been the case and Intel alongside its OS partners would have managed to push it no matter what.

flohofwoe · on Feb 4, 2023

In that case, another more proven RISC architecture like Alpha would have replaced x86. At that time, the only reason why x86 was still "competitive" was the enormous amount of x86 software (specifically Windows software). If Microsoft would have to switch to another ISA anyway, there wasn't really a reason to bet on something as risky as Itanium.

moomin · on Feb 4, 2023

I mean, that happened and is still happening. ARM dominates markets that didn’t exist at the time, and is constantly chipping away and x86s strongholds.

pjmlp · on Feb 4, 2023

For what exactly, it isn't as if Solaris was winning any desktop usage.

Microsoft and HP were already on Intel side, also Microsoft already had experience with JIT compiling x86 thanks to their collaboration for Windows NT on Alpha.

flohofwoe · on Feb 4, 2023

Yeah, but once Microsoft would have realized that Itanium was a performance dead end, and AMD wouldn't have jumped in as the saviour of x86, I bet Microsoft wouldn't give a shit about their good Intel relationship any longer and instead move to Alpha.

xattt · on Feb 4, 2023

I would argue that it would have gone to PowerPC given their decision with the Xbox 360.

pjmlp · on Feb 4, 2023

Alpha was one of the first architectures that they left behind.

whatisyour · on Feb 4, 2023

Itanium failed because of VLIW instruction set, not because of any other reason.

pjmlp · on Feb 4, 2023

Without alternatives, Windows and UNIX partners would have pushed it no matter what.

flohofwoe · on Feb 4, 2023

If x86 compatibility would have been left behind anyway it would have opened up Windows to Intel competitors (I remember that PA-RISC was pretty hot at the time, and Alpha was also still quite revelant, although that was in the years right before the Itanium). AMD essentially saved Intel from becoming the new IBM.

pjmlp · on Feb 4, 2023

HP was also moving into Itanium.

flohofwoe · on Feb 4, 2023

To the massive disappointment of the last Amiga die hards ;)

acdha · on Feb 4, 2023

In addition to Alpha, PowerPC was shipping in volume, reasonably competitive with a high-end option, and Windows NT had already shipped support for it along with MIPS.

The selling point for Itanium was compatibility but when they failed so badly at that it leveled out the field since you were going to have to recompile anyway.

pjmlp · on Feb 4, 2023

Outside Apple and IBM, no else cared for PowerPC in any significant size for desktop users.

Windows NT originally shipped with support for all major CPUs targeted by UNIX workstations, yet all of them faded away until Itanium.

trelane · on Feb 4, 2023

> Windows NT originally shipped with support for all major CPUs targeted by UNIX workstations, yet all of them faded away until Itanium.

Sure, and as any Linux advocate will tell you, most folks on Windows are stuck there due to the proprietary applications that only run there. They don't care much about the OS, but they need the apps that they know and which have their data locked away.

These apps didn't run on the other processors, so Windows on other arches was mainly a curiosity.

pjmlp · on Feb 4, 2023

Actually they did on Alpha, and Microsoft has just recently did the same for ARM. They would have done the same for Itanium if AMD64 had not happened.

https://en.m.wikipedia.org/wiki/FX!32

https://learn.microsoft.com/en-us/windows/arm/apps-on-arm-x8...

acdha · on Feb 4, 2023

Yes, but that was largely due to Itanium’s initial optimism. My argument would be that had Intel’s x86 line also faltered, you’d have seen a lot more interest in those alternatives which had much better price/performance and also simply things like a suitable range of parts (not many people want to listen to a huge power supply on their desk). I don’t think there’s any path where people would have plunged ahead with VLIW without a radically better compiler scene.

raverbashing · on Feb 4, 2023

Maybe. Who knows. Maybe PAE could have been stretched further

Or maybe, just maybe one of those other vendors would have gotten their head out of the sand and made a 64bit processor that ran windows. But I think Wintel was set too deep to anyone challenge that

whatisyour · on Feb 4, 2023

I disagree. There was ARM which would have taken over the mantle. And given the recent wins of M1 just due to abilty to decode instructions better, x86 would have died much earlier.

fanf2 · on Feb 4, 2023

At that time (early 2000s) there wasn’t a high-performance ARM design. There had been StrongARM in the late 1990s, but it was sold by DEC to Intel, who killed it off. The designers moved to AMD to work on amd64.

hydroreadsstuff · on Feb 4, 2023

And because Itanium's performance was worse or at least not much better than their x86 parts.

pkphilip · on Feb 4, 2023

And price

indymike · on Feb 4, 2023

> x86-64

That came after AMD64. It was Intel trying to prevent brand damage by saying their AMD compatible cpus were not knock-offs of a competitor that had plagued them with knock-offs.

hakfoo · on Feb 4, 2023

Actually, it was originally called x86-64. The AMD64 branding came around about the time the chips were actually released.

Back around 2002 or 2003, I sent away to AMD for a set of reference manuals, got back a nice five-volume set which said "x86-64" everywhere and a little note in the box saying "where it says x86-64, read AMD64"

gmokki · on Feb 4, 2023

Does anyone remember when first Intel processors with AMD64 support came out Intel called it "IA32e" to downplay the importance. To differentiate it from the IA32=x86 (32bit) IA64=Itanium

tadfisher · on Feb 4, 2023

I remember it as "EM64T", but Wikipedia tells me that is just the marketing name for IA32e. TIL!

mpol · on Feb 4, 2023

AMD was more like the last straw that broke the camel's back.

Itanium was already in trouble. It was hot (really hot) and underperforming. It wasn't selling really well, since it was too expensive. Of the UNIX vendors, only HP was left standing behind Itanium. IBM had already long pulled out of Itanium and also out of Monterey, the UNIX that would unify all unixes.

AMD64 was the light that suddenly came shining and everybody knew that was where everybody was going.

threatripper · on Feb 4, 2023

Itanium is something that sounds good in theory but didn't quite deliver in practice. It is my impression that this is how things usually turn out when you do top-down design of a new product. Maybe my impression is wrong and it has nothing to do with top-down design and it's just that the majority of products fail. Maybe 99 out of 100 fail and you have to launch 100 to have a statistical chance to get 1 winner. Intel didn't launch 100 different processor designs. They surely have a dozen designs that got killed early but then they focused most of their energy on one design and that one failed. Not enough good luck probably.

adrianmsmith · on Feb 4, 2023

I'm sure you're right in general - but I would say that a counterexample, or the 1 in 100 top-down design of a new product that did succeed, is the iPhone.

dsr_ · on Feb 4, 2023

The first iPhone appeared after five generations of iPod, and (I believe) was co-developed with the iPod Touch, the sixth generation.

ghaff · on Feb 4, 2023

Furthermore, there was no compatibility expectation with phones. In fact, there really weren't third-party apps for phones to any meaningful degree. Other than nerds, who knew what processor was in their Blackberry or Treo?

Apple did reinvent the smartphone. But, like the iPod, it didn't fully hit its stride for a few years.

fishtacos · on Feb 4, 2023

Is it really co-development if it's the same device lacking a GSM radio?

lp4vn · on Feb 4, 2023

If I remember well what I learned in college, the organization of x86 processors is superescalar, which means that the processor uses an internal statistics mechanism for predicting the subsequent code that's going to be executed. Itanium on the other side used the VLIW architecture so that the instructions are optimized for a long vector of execution of instructions already in compilation time.

I always found the idea behind the VLIW processor architecture to be a quite good one to be honest, but I read many engineers in many places saying that it's a bad one and it was doomed from the beginning.

The article says that the death of Itanium is mostly due to the disinvestment in the IA-64 caused by the threat of AMD overtaking the x86 market, even though the competition for the x86 market probably benefited all of us, I still find it a bit sad that there was this loss of architectural diversity and sometimes I wonder how well the Itanium would perform today if it weren't killed.

wtallis · on Feb 4, 2023

Superscalar just means the processor can execute more than one instruction at a time (in parallel, not just pipelined). The main distinction between VLIW and mainstream x86 is in how the instruction execution is scheduled: x86 cores track dependencies between instructions, and most support reordering instructions to deal with stalls from things like cache misses or the next instruction being of the wrong type for any of the free execution ports. VLIW relies on the compiler to schedule instructions, so that the hardware does not need to do dependency tracking.

devit · on Feb 4, 2023

And the problem is that such a thing doesn't really work on mainstream CPUs, because memory access instructions take a highly variable amount of cycles depending on which cache level the data is in, which the compiler cannot know, and generating code for all possibilities leads to exponential blowup of the source code size.

It's not clear how Intel thought it could possibly work.

dw-im-here · on Feb 4, 2023

predication and queues can go a long way

inkyoto · on Feb 4, 2023

Out of three VLIW architectures I have looked into, 2x (Elbrus and Itanium) rely on the predication heavily. i860 does not have instruction predicates.

Predication places the burden of creating optimal instruction bundles AND the correct hinting via the use of predicates on the compiler. If stars aligned, the code could perform blazingly fast. It turned out that aligning the stars in an optimal space time sequence was an arduous task due to the actual hints only being available at the runtime.

Which is where JIT has delivered well (and cheaper!) without requiring a radically different VLIW design.

geertj · on Feb 4, 2023

Fundamentally it seems though that more information is available at run time? You may get partway there in the compiler, but assuming you have sufficient transistor budget, it seems more optimal to do reordering in the CPU.

zozbot234 · on Feb 4, 2023

The runtime doesn't know all that much, though. All it has is a single instruction flow, that it can extract fine-grained parallelism from and try to speed up further via speculation. Nothing whatsoever about other work that may be scheduled in when the processor is stalled by memory, other than via SMT. Nothing about priorities or coarse-grained dependencies among work units. So there's a lot of parallelism that's left on the table, and a lot of speculated work that might just be wasted.

inkyoto · on Feb 4, 2023

> The runtime doesn't know all that much […]

If we are talking about JIT, yes, it does, for it instruments the runtime, gathers the information about hot code paths and performs the in-place optimisation. Think of the profile guide compile time optimisation having been carried over into the runtime.

wrp · on Feb 4, 2023

> ...I read many engineers in many places saying that it's a bad [idea]...

I read an article several years ago from an engineer at...ooh, I think it was DEC or IBM. He said that during development of the Itanium, the Intel guys had talked to them and they advised most strongly that Intel drop the project, because they had been down that road and thought it was a dead end.

dontlaugh · on Feb 4, 2023

VLIW still lives today in various guises. GPUs are the most obvious.

Even the M1 could be argued to be close, it’s a very wide machine.

wespiser_2018 · on Feb 4, 2023

No, not really.

VLIW is considered: Multiple Instruction Multiple Data, in each line of assembly you can send out something like 4 (or 8) instruction each with a different target, and it will work as long as there aren't dependency issues.

GPUs are still Single Instruction Multiple Data (SIMD), for every vector operation you are doing operation: adding vectors, taking a dot production you are only executing a single op at a time.

SIMDs are really close to the RISC/CISC paradigm, and there's various extensions for other types of SIMD processing in different ISAs used today. VLIW is a much different set of assumptions, requiring the compiler to program in the same instruction level parallelism that a superscalar chip will parallelize via it's architectural features (pipelines/branch prediction/et cetera).

fulafel · on Feb 4, 2023

Qualcomm basebands apparently use VLIW in baseband in an architecture they called Hexagon: https://www.llvm.org/devmtg/2017-02-04/Halide-for-Hexagon-DS...

VLIW seems quite suited to DSP / SDR applications.

fulafel · on Feb 5, 2023

Better PDF with architecture picture about where it's used: https://developer.qualcomm.com/download/hexagon/hexagon-dsp-...

More links: https://en.wikichip.org/wiki/qualcomm/microarchitectures/hex...

https://pages.cs.wisc.edu/~danav/pubs/qcom/hexagon_micro2014...

https://blog.tensorflow.org/2019/12/accelerating-tensorflow-...

hajile · on Feb 5, 2023

AMD's latest RDNA3 is explicitly VLIW with the VLIW instructions working on SIMD units.

Ironically, they are also running head first into the compiler issues with almost nothing around taking advantage of their dual issue potential.

trelane · on Feb 4, 2023

> GPUs are still Single Instruction Multiple Data (SIMD), for every vector operation you are doing operation: adding vectors, taking a dot production you are only executing a single op at a time.

Sort of. It's both, really. On nvidia at least, the threads in the warp are simd, but between warps it's mimd. And that's before we get into SMs.

flohofwoe · on Feb 4, 2023

Can GPUs really be considered VLIW just because they might share an instruction pointer across a couple of 'micro cores'? AFAIK they went even back from SIMD to scalar a long time ago.

wespiser_2018 · on Feb 4, 2023

No, I don't think so. It's one instruction on multiple data (SIMD). That's the essence of a vector op!

dontlaugh · on Feb 4, 2023

I suppose. They’re still very wide machines, which to me seems like the most interesting aspect of VLIW.

Of course actual VLIW still are around as DSPs.

skavi · on Feb 4, 2023

Recent GPUs have been RISC for some time now. Since AMD replaced their TeraScale architecture.

hajile · on Feb 5, 2023

VLIW has nothing to do with width and everything to do with explicit ILP (instruction level parallelism).

m_mueller · on Feb 4, 2023

.. and then Nvidia brought their GPUs to the clusters and ate both Intel’s and AMDs lunch with a better version of VLIW. The CUDA programming model and hardware IMO is quite successful at abstracting vectors the right way[TM], separating the problem of the grid setup from that of the kernel code (which you can still largely program in a scalar way if you’re not after the last ~30% of performance). IMO a shame that OpenCL never got good.

nullc · on Feb 4, 2023

One of the finest illustrations on all of Wikipedia, perhaps on the entire Internet:

https://en.wikipedia.org/wiki/Itanium#/media/File:Itanium_Sa...

I have itanium on a shelf somewhere-- while I was using it I got to do some assembly level debugging to track down a report a GCC bug I hit. Well worth in entertainment the $50 or whatever I paid for it.

ck45 · on Feb 4, 2023

Nice story, but it doesn't fully add up: "When Intel started Itanium development work in the mid 1990s"

That's more than a decade after AMD releasing their first x86 compatible CPU. Intel was very aware of the threat before they started the work on Itanium. They even tried to hold back AMD with a lawsuit which they lost and ultimately allowed AMD to release their 80386 compatible CPU.

I find it more likely that it failed for similar reasons why iAPX 432 and i860 failed. There just wasn't a market.

beebeepka · on Feb 4, 2023

https://jolt.law.harvard.edu/digest/intel-and-the-x86-archit...

snvzz · on Feb 4, 2023

AMD could, as easily if not more so, be the one to show up with a native RISC-V processor that has a legacy x86 mode.

In the modern landscape, it'd be closer to a v86 mode, where the hypervisor is RISC-V, but user x86 applications can run full speed.

All the legacy PC platform crap can be done away, replaced by RISC-V standard OS-A profile.

als0 · on Feb 4, 2023

Why would AMD do that today though? They are in a privileged position to have an x86 license and zillions of customers asking for an x86. AMD once made some ARM chips and realised they didn’t need to.

jeroenhd · on Feb 4, 2023

Agreed. The only reason I can think of for AMD to set up an alternative instruction set would be a sudden rise in competition from other players. If big advances are made in the RISC-V space that make the architecture a cost effective alternative to amd64 (including the cost of porting software to RISC-V) then I can see them setting themselves up to allow running software on both platforms at native speeds.

I don't think AMD is currently limited by their instruction set. Even if they are, there may be an argument to move to ARM instead of RISC-V to take advantage of the software already ported because of Apple's transition and the Graviton chips. Windows already runs on ARM but hasn't been announced to run on RISC-V, after all.

snvzz · on Feb 4, 2023

>would be a sudden rise in competition from other players.

This is exactly what I see happening. AMD will have to move to RISC-V to stay competitive, and x86 acceleration is a compelling feature they can offer.

>Windows already runs on ARM but hasn't been announced to run on RISC-V, after all.

I doubt this one will be an issue for long. During last summit, in talks by the RISC-V foundation itself (specifically, the technical ones about ongoing ISA work), Windows was mentioned a few times as the reasoning for some new specifications.

This strongly implies Microsoft is working on Windows for RISC-V, even if Microsoft themselves haven't said a word about it.

als0 · on Feb 5, 2023

> This is exactly what I see happening. AMD will have to move to RISC-V to stay competitive

The year is 2259. The name of the place is Babylon 5.

pedrocr · on Feb 4, 2023

Does the core area math check out on that? Is the x86 instruction decoding area a small enough part of today's cores that having a chip that includes both a RISC-V decoder and a x86 one not a big penalty?

It's also not clear that would be a gain for AMD. x86 has a lot of lock-in and AMD is one of two viable suppliers of it. Helping along a RISC-V transition would create opportunities for attackers that don't need to license x86. AMD doesn't seem to have a reason for that right now. But maybe that's an Innovator's Dilemma kind of situation and they should be cannibalizing the present to setup their future.

flohofwoe · on Feb 4, 2023

I find it incredibly fascinating how the x86 prevailed again and again even though it was a compromise and stopgap solution right from the start (because the "proper" modern Intel CPU of the 80's was supposed to be the iAPX 432).

Narishma · on Feb 4, 2023

'Worse is better' seems to also apply to hardware.