Successful architectures seem to need a certain degree of pragmatism. ARM isn't exactly the RISCiest RISC, nor is AMD64 as baroque as the outer limits of CISC like iAPX 432.
FJCVTZS is an example of pragmatism, the JavaScript spec says float to int should be done the way that x86 does it, the original ARM FCVTZS (no J) didn't do it the same way, but JavaScript is so important you have to add a special case.
I hope I'm not mischaracterising the RISC-V side, but I seem to recall their argument against things like FJCVTZS was that that there should be some standard set of instructions that compilers should emit for that special case, and the instruction decoder on high end CPUs should be magic enough to detect the sequence and do optimal things (fused instructions?). Which kinda felt like "we must keep the instruction set as simple as possible, even if it makes the implementation of high performance CPUs complex". See also the "compressed instructions" stuff, which feels again like passing the buck for complexity onto the CPU implementation side (unless it's just a Thumb like 16 bit wide instruction set thing given a misleading name).
So, with RISC-V the design pretty deliberately enables a combination of compressed instructions and macro op fusion.
The compressed instructions are quite lightweight. It's generally an assembly level thing, and the decoder on the cpu side is apparently ~400 gates.
The compressed instructions are indeed a 16 bit wide thing, but fixing some of the flaws in Thumb. Generally they have more implicit operands or operands range over a subset of registers to fit in 16 bits.
But the hat trick is these two dovetail into each other, such that a sequence of compressed instructions can decompress into a fuse-able pair/tuple, which then decodes into a single internal micro op. This creates a way to handle common idioms and special cases without introducing an ever growing number of instructions. Or at least that's the basic claim by the RISC-V folks. I think they've done enough homework on this to not be trivially wrong, so it'll be interesting to see how things go.
To be honest I kind of understand this “passing the buck”. In computing in general you never trust the guy up the stack to give you good input. Query engines do filter reordering because they don’t trust the optimizer to get it right. Compilers do optimizations because they don’t trust the programmer to get the order of operations right (rightfully). CPUs do OOO because they don’t trust compilers to get the order of instructions right.
The way I see it is there are 2 variants: 1) make a specific instruction (clutters the instruction set, makes processors who don’t care implement it), 2) rely on processors who care to implement instruction fusion, and those who don’t will do it the slow way.
Either way, it gets implemented in hardware, and processors who care need to make a change in the front end.
> CPUs do OOO because they don’t trust compilers to get the order of instructions right.
Not really. CPUs do out-of-order because cache hits are unpredictable and it is crucial for single-threaded performance to make progress on dependent operations as soon as a loaded value is available.
There may be other, lower order, factors, but variable memory latency is the real reason.
To defend ARM (what? A RISC-V guy defending ARM?) there is absolutely nothing un-RISC about FJCVTZS. Every instruction set with floating point has some way to convert an FP value to an integer. FJCVTZS is no more complex than the existing FCVTZS -- it simply uses a different rounding mode and different behaviour if the value is too big.
I don't know what you think RISC-V "compressed instruction" means. It's precisely equivalent to ARM Thumb2 -- there are 16 bit opcode and 32 bot opcodes and you can tell which you have by looking at 2 bits (RISC-V) or 3 bits (Thumb2) in the first 16 bits of the instruction.
I don't believe there is any practical "magical" sequence of instructions that could be easily recognised to implement Javascript conversion from float to int. If that is in fact as important as ARM apparently think it is (I have my doubts) then an equivalent of FJCVTZS should be added to RISC-V as an extension.
As for "making the implementation of high performance CPUs complex" … high end CPUs are unavoidably complex. A little bit more is not a big deal. On the other hand, adding complexity to low end CPUs can easily be a complete deal-killer. Splitting an instruction into µops might be a little simpler than combining instructions into macro-ops, but it's not as simple as not having to do it.
Ironically, the people who criticise RISC-V for talking about macro-op fusion seem to be ignorant of the fact that no currently shipping RISC-V SoC does macro-op fusion [1], while every current higher end ARM and X86 does do macro-op fusion of compare (and maybe other ALU) instructions with a following conditional branch instruction.
[1] SiFive U74 can tie together a forward conditional branch over a single integer ALU instruction with that following instruction. They pass down the two execution pipes in parallel (occupying both i.e. they are still two instructions, not a macro-op). The ALU instruction executes regardless, but the conditional branch controls whether the result is written back. i.e. it effectively converts a branch into predication
> I don't believe there is any practical "magical" sequence of instructions that could be easily recognised to implement Javascript conversion from float to int. If that is in fact as important as ARM apparently think it is (I have my doubts) then an equivalent of FJCVTZS should be added to RISC-V as an extension.
They claim 2%, but only in JS code. I'd guess static analysis of outputted v8/JSC/SM JIT code from the top 100 websites would give a very accurate estimation of the savings. One of the most fundamental performance boosters is using 31-bit ints instead of doubles, but every single time time the user needs to access a number for output, it must be converted to a double to keep the JS spec contract.
All that said, I think only Apple's last 4-6 chips and ARM's most recent generation of chips actually implement the instruction and people have been fine without it. I'd guess we'll not be seeing this in RISC-V until much lower-hanging fruits have been picked.
that there should be some standard set of instructions that compilers should emit for that special case, and the instruction decoder on high end CPUs should be magic enough to detect the sequence and do optimal things (fused instructions?)
Detecting a long fixed sequence of instructions and "compressing" them into one internal operation seems like it would require a lot of fetch bandwidth and/or a really wide decoder. x86 has had macro-fusion since Core Solo/Duo.
Those downsides would be real, depending on how awkward the set of instructions is, but on the plus side risc-v should be able to handle a lot more instructions per cycle in a given power/area budget.
FJCVTZS is an example of pragmatism, the JavaScript spec says float to int should be done the way that x86 does it, the original ARM FCVTZS (no J) didn't do it the same way, but JavaScript is so important you have to add a special case.
I hope I'm not mischaracterising the RISC-V side, but I seem to recall their argument against things like FJCVTZS was that that there should be some standard set of instructions that compilers should emit for that special case, and the instruction decoder on high end CPUs should be magic enough to detect the sequence and do optimal things (fused instructions?). Which kinda felt like "we must keep the instruction set as simple as possible, even if it makes the implementation of high performance CPUs complex". See also the "compressed instructions" stuff, which feels again like passing the buck for complexity onto the CPU implementation side (unless it's just a Thumb like 16 bit wide instruction set thing given a misleading name).