The Mill faces the same compiler problems that Itanium and other VLIWs have faced. There's only so much available ILP to be extracted and then even that's hard to do. It turns out that the techniques that Fisher developed, trace scheduling [1] and its successors, work equally well for superscalars thus bringing no net advantage to VLIWs.
The great wide hope lives on, somewhere in the future.
IDK, their register model "the belt" has a temporal addressing scheme that seems to lend itself well to software pipelining in a way that's a pain to extract with a standard register set.
Itanium did as well but then to no avail. It had direct support for modulo loop scheduling. Also, register renaming (which is temporal) is useful for software pipelining.
I think the Mill people should concentrate on what VLIW has had some success with in the past: embedded. There will be tears if they go after general purpose.
Compiler problems are not the sole reason Itanium failed, perhaps not even the primary one. They were not good initially, that's true. But Itanium was more killed by a combination of factors, and especially by AMD64 existing.
The first Itanium sucked, and had a very bad memory subsystem. Itanium 2 was pretty competitive in performance, with one major exception: the x86 emulation. Why buy Itanium—even if it's very fast—if you could buy cheaper AMD64 which supports your existing software? Add in that Itanium compilers were few in number, all proprietary (and expensive), and didn't generate very good code for the first few years. Also, Itanium violated some programmer assumptions (very weak memory model and accessing an undefined value could cause an exception).
Now, Mill has a better way of extracting ILP than Itanium does, compiler technology is much better, and JIT compilers are very common. VLIW processors can be very good JIT targets. Mill, if it ever materializes, has enormous potential.
Obviously I agree about Itanium vs AMD64. That's a one punch knockout. It was freaking brilliant of AMD.
However the Mill doesn't extract ILP. Compilers do that. And given a code base there's only so much available. Yes, compiler technology is much better now and still there's only so much ILP available.
Lastly, VLIW has been tried with JITs at least twice: Transmeta Crusoe and IBM Latte. VLIW code generation is hard and it's harder if you have very little time which is the nature of JIT.
Denver is not a VLIW; it's 7-way superscalar [1]. Haswell is 8-way. Wide superscalar is really common and has a lot of the advantages of VLIW without the impossible compiler headaches.
Haswell is generally considered 4-wide [1]. As far as I heard, Denver really is VLIW. I don't think there are many in-order [2] CPUs that are that wide. I think the article is using superscalar loosely (as in 'can execute more than operation per cycle' which is true for VLIW, although they are techinically all part of the same instruction).
[1] apparently it can sustain 5 instructions per clock given some very specific instruction mix.
[2] or out of order, really, Power8 being the exception.
It isn't an OoO, at least, not a 224 window OoO. 7-wide can simply mean that it has the decode and execution resources of 7 per cycle. It's a tablet core, they don't have a 7-wide OoO core in there.
I was at the EE380 talk :). It's amazing how few people show up to EE380 these days.
Yeah, Denver is in-order superscalar and it's a JIT but it isn't VLIW. Sad to say, they've tried JITing to in-order superscalar as well. They had a design win with the Nexus but even now NVidia is switching over to RISC-V for Falcon.
It is not from lack of effort that the JIT approach hasn't really worked. It's competitive but not outstanding. Denver, from the EE380 talk, thrives on bad bloated code. It's not so good on good code. This is not a winning combination.
Well, Falcon is intended to be a controller, not a (relatively) high performance core, so that's not an apples to apples comparison. If it's not VLIW, then what is it? In-order superscalar? You mean superpipelined or scoreboarded (like an ancient Cray?)?
Bad bloated code is 90%+ of the code in the world ;)
Also, shoot me an email at sixsamuraisoldier@gmail.com you seem to have some inside info on Denver, I'd love to chat (don't worry, I won't steal any secrets ;) )
This. People fail to understand the Itanium wasn't necessarily the only representative for VLIW.
Better compiler tech in the past few years (you mentioned JIT for example, which Denver has adapted to do quite well) has made VLIW a strong technical contender in several markets. Alas, the overhead cost of OoO is no longer the issue in modern computer architecture.
Denver is a JIT but the microarchitecture is 7-way superscalar [1]. A lot of the Transmeta people ended up on Denver and I'll guess they didn't want to repeat VLIW.
The Denver 2.0 CPU is a seven-way superscalar processor supporting the ARM v8 instruction set and implements an improved dynamic code optimization algorithm and additional low-power retention states for better energy efficiency.
The great wide hope lives on, somewhere in the future.
[1] https://people.eecs.berkeley.edu/~kubitron/cs252/handouts/pa...