> Remember when RISC didn't have division instructions because that was a complex microcode? I remember.
> Over in CISC land, x86 'DIV' was...
Don't forget that the same thing existed / exists in the CISC world. In the "old days" floating point was an optional unit, not just on some Mainframes but on early x86s as well, with the 8087 unit, whose quirks were in the design of IEEE 754 (avoiding the quirks!). Vector units and such were also external add ons.
The guiding theme, if there is/was one of CISC architectures was the ability for people to write assembly code. That's why there are all those string manipulation instructions and the like: think of those old instruction sets as basically ALU manipulation plus some convenience subroutines implemented in hardware/microcode.
The breakthrough of the 801 was realizing that with the rise of compilers, those convenience features were no longer needed and all the work required to support them was wasted and could be jettisoned.
I really don't understand why Intel and AMD haven't fully implemented this point: just implement the instructions that compilers use, plus the ones needed for bootstrapping and kernels. Put all the "legacy" instructions into user mode library code. It would simplify the silicon, likely reducing bugs.
BTW we are still in this functional unit environment, in spades: look at how much die area on the Apple M1 is used for non-CPU computation units.
> The guiding theme, if there is/was one of CISC architectures was the ability for people to write assembly code. That's why there are all those string manipulation instructions and the like: think of those old instruction sets as basically ALU manipulation plus some convenience subroutines implemented in hardware/microcode.
> The breakthrough of the 801 was realizing that with the rise of compilers, those convenience features were no longer needed and all the work required to support them was wasted and could be jettisoned.
There's a deeper reason behind the change: the designers could assume ubiquitous instruction caches.
The majority of the benefit of microcode during the age of CISC supremacy was that these systems were true von Neumann systems, and every instruction fetch competed with the bus for a data access. The CISCy microcode gave you a pseudo harvard arch where say, your memset could use all of it's cycles actually moving data around.
Once ISA designers could assume I-caches, then now you can give everyone that pseudo harvard architecture benefit for any code they want to write, not just the routines you as the ISA designer think of ahead of time and put into ROM.
As an aside, this is probably what killed off the microcoded virtual machine archs like the lisp and smalltalk machines. Almost all of their benefit was that by putting the interpreter loop into microcode, it's ucode rom fetches didn't compete with the bytecode program and data fetches. Once an I-cache was present, anyone could write their own interpreter with the same properties without having to buy custom hardware. So it wasn't the C language ubiquity that killed them off, but the ubiquitous I-cache.
Never thought of the micromachine as a kind of harvard architecture (when I wrote microcode I just thought of the macroinstructions as data — this was back in the 80s) but it’s an interesting idea.
> As an aside, this is probably what killed off the microcoded virtual machine archs like the lisp and smalltalk machines.
The hardware provided other benefits (pointer and literal tagging, for example; also GC hardware). It wasn’t that C overwhelmed lisp, it was that other functional units on the workstation could provide similar functionality but with the hardware advantages of scale on a bigger customer base.
For example using the pager as a barrier for a transporting GC — essentially you are storing the tag bits in the TLB.
Also most people found interpreters weird (and still do), so speeding that up a little with custom hardware didn’t help many people.
Still, your point about the impact of the I cache is interesting.
It was heavily influenced by the original Lisp Machine (PDP-10) which had a simple, regular, and orthogonal instruction set, itself practically the first RISC machine.
If you look at the old Radin or Henessy papers, indeed the statement might seem far fetched. The PDP-6/10 did not start from the perspective of a powerful compiler, for example.
But the big machines of the 60s and 70s had lots of features (as I alluded to in my root comment) for developers, like BCD support (survived into x86) string manipulation, variable length instructions etc. Just look at the Sperry, IBM and other big machines of the time.
The ‘10s instruction set, as I noted above was quite regular and could be implemented in hardware (as it was in the KA at least): simple, regular, and easily predicted. Utterly the opposite of where the CISC guys were going.
Of course the whole CPU architecture of a machine like the KA was trivial by today’s standards, with no microarchitecture, so to some degree the simplicity of design was a bottom up constraint as well, and in that regard, to loop back to the top of this comment, was the opposite of the motivations that drove the idea of “reduce” in RISC
> I really don't understand why Intel and AMD haven't fully implemented this point: just implement the instructions that compilers use, plus the ones needed for bootstrapping and kernels. Put all the "legacy" instructions into user mode library code. It would simplify the silicon, likely reducing bugs.
There are actually very few such instructions. The BCD arithmetic instructions, the BOUND instruction, MPX (which is already removed in current architectures), arguably the entire x87/MMX instruction sets. Removing x87 is hard because it's required for i386 ABI reasons (floats/doubles are returned on the x87 stack in i386, SSE registers in x86-64). MPX is already axed, and the others stick around only for backwards compatibility (not available in x86-64) and are likely already microcoded already.
Note that the compiler already emits REP STOSB instructions as it's the fastest way to do memcpys these days.
You could probably eliminate the entire 32 bit support, real mode and all of 16 bits, segmentation and such and just run it in emulation. Can modern x86 even run 8080 code?
I wonder how much that would save though. Surely the register file would be easier to implement? Benefits would come from smaller microcode (less code, fewer bugs) and any hardware needed to support it too.
> You could probably eliminate the entire 32 bit support, real mode no such and just run it in emulation.
That would literally obsolete every single motherboard on the market, and force those motherboard makers to make a new bootup cycle.
And there's still a lot of programs that run in 32-bit Windows by the way. Like, the near entirety of Good ol Games. I still like playing SimCity 2000, Heroes of Might and Magic, and Panzer General.
> Can modern x86 run 8080 code?
8080 ? Of course not. There was a clean break to 8086.
And then we never dropped compatibility with 8086.
Backward compatibility. They could microcode all the lesser used instructions, but the surface area of existing code is very large, and intel and AMD care more about running existing code faster than new code.
There is a reason that even the obsolete x87 floating point stack still runs a near optimal speed.
Also I don't think it is very expensive to maintain most rare instructions. The cost is primarily in encoding space, but until they support a different ISA (possibly as an alternate mode), they don't have an option.
There is also the "small" advantage that a very complex architecture is hard to implement, validate, and/or emulate, giving an advantage against the competition.
> There is a reason that even the obsolete x87 floating point stack still runs a near optimal speed.
That's because SSE / AVX are faster than x87 floating point instructions. So modern CPUs just microcode-translate the x87 instructions into SSE / AVX micro-ops under the hood.
They do not translate x87 to SSE/AVX under the hood. It's goofy enough, (not just with the extra precision, but the status word needs to be renamed too) that it has dedicated hardware. Therefore there's a seperate register file that stores x87/mmx state (and avx-512 k mask registers).
I was going to say that there are no SSE/AVX micro ops and x87, SSE, AVX, AVX512 just get translated to the same internal format that implement the superset of all specific instruction behaviours, but looking at the instruction tables, for example for Ice Lake, you can see that the legacy FADD is converted to exactly one uop that is run on port 5, while ADDSS is also one uop but it can be executed on either port 0 or 1. So it seems that at least Ice Lake still has x87 specific uops.
You can see that something like the legacy FCOS is instead definitely microcoded as it expands to hundreds of uops. This has been the case for at least two decades.
> I really don't understand why Intel and AMD haven't fully implemented this point: just implement the instructions that compilers use, plus the ones needed for bootstrapping and kernels. Put all the "legacy" instructions into user mode library code. It would simplify the silicon, likely reducing bugs.
They mostly have - those esoteric instructions are slower than executing the equivalent with more common instructions yourself. Its clearly the bare minimum to support back compat with the least die area possible.
> I really don't understand why Intel and AMD haven't fully implemented this point: just implement the instructions that compilers use, plus the ones needed for bootstrapping and kernels. Put all the "legacy" instructions into user mode library code. It would simplify the silicon, likely reducing bugs.
You mean how x87 instructions is microcode emulated on SSE, which is microcode emulated on AVX hardware? (EDIT: I had a tidbit on MMX but I think I got my history wrong there)
None of those x87 instructions "exist" anymore. The CPUs support them, but its just microcode emulation. There's no x87 stack, or 80-bit registers on modern computers anymore. Its all careful emulation.
> The guiding theme, if there is/was one of CISC architectures was the ability for people to write assembly code. That's why there are all those string manipulation instructions and the like
REP MOVSB is actually the fastest way to memcpy on Intel machines, actually, thanks to "enhanced movsb".
It turns out that a single instruction to do memcpy is a really, really good idea.
x87 is 80-bit floating point. They literally don't fit inside of 64-bit doubles of SSE.
The extra bits need to be emulated.
EDIT: And I'm sure there's some program out there that actually relies on those extra 16-bits of precision, and they'd be pissed if their least-significant bit had a fraction-of-a-bit more error per operation.
They are not emulated, they run at optimal latency (in fact on Ice Lake FADD has better latency than ADDSD!), although at a lower throughput as there are less dedicated execution units.
That's a strong point. I guess they really aren't emulated then.
That really makes me wonder how the 80-bits are stored then. I guess the "stack" is just part of the register-renaming mechanism? Huh... AVX registers are 256-bits, so I guess 80-bits fits in each one.
Yes, x86 stack per se doesn't exist anymore and it is mapped to the general register file. I have no idea how the 80 bits are handled. I thought that the AVX registers mapped to multiple entries in the file, but maybe I'm wrong.
> Over in CISC land, x86 'DIV' was...
Don't forget that the same thing existed / exists in the CISC world. In the "old days" floating point was an optional unit, not just on some Mainframes but on early x86s as well, with the 8087 unit, whose quirks were in the design of IEEE 754 (avoiding the quirks!). Vector units and such were also external add ons.
The guiding theme, if there is/was one of CISC architectures was the ability for people to write assembly code. That's why there are all those string manipulation instructions and the like: think of those old instruction sets as basically ALU manipulation plus some convenience subroutines implemented in hardware/microcode.
The breakthrough of the 801 was realizing that with the rise of compilers, those convenience features were no longer needed and all the work required to support them was wasted and could be jettisoned.
I really don't understand why Intel and AMD haven't fully implemented this point: just implement the instructions that compilers use, plus the ones needed for bootstrapping and kernels. Put all the "legacy" instructions into user mode library code. It would simplify the silicon, likely reducing bugs.
BTW we are still in this functional unit environment, in spades: look at how much die area on the Apple M1 is used for non-CPU computation units.