Debunking CISC vs. RISC code density

dragontamer · on Dec 2, 2022

The definition of RISC has changed.

Remember when RISC didn't have division instructions because that was a complex microcode? I remember.

Over in CISC land, x86 'DIV' was all you needed to divide. In RISC land (ARMv7 or earlier), you needed to write a division loop.

------

With RISC machines getting divide instructions, the delineation of RISC vs CISC is meaningless to me. All RISC machines have taken the best parts of CISC (Divide instruction. SIMD-instructions. AES instructions. Complex addressing schemes), and all CISC machines have taken the best parts of RISC (high register counts, pipelines, load/store architectures)

gumby · on Dec 2, 2022

> Remember when RISC didn't have division instructions because that was a complex microcode? I remember.

> Over in CISC land, x86 'DIV' was...

Don't forget that the same thing existed / exists in the CISC world. In the "old days" floating point was an optional unit, not just on some Mainframes but on early x86s as well, with the 8087 unit, whose quirks were in the design of IEEE 754 (avoiding the quirks!). Vector units and such were also external add ons.

The guiding theme, if there is/was one of CISC architectures was the ability for people to write assembly code. That's why there are all those string manipulation instructions and the like: think of those old instruction sets as basically ALU manipulation plus some convenience subroutines implemented in hardware/microcode.

The breakthrough of the 801 was realizing that with the rise of compilers, those convenience features were no longer needed and all the work required to support them was wasted and could be jettisoned.

I really don't understand why Intel and AMD haven't fully implemented this point: just implement the instructions that compilers use, plus the ones needed for bootstrapping and kernels. Put all the "legacy" instructions into user mode library code. It would simplify the silicon, likely reducing bugs.

BTW we are still in this functional unit environment, in spades: look at how much die area on the Apple M1 is used for non-CPU computation units.

monocasa · on Dec 2, 2022

> The guiding theme, if there is/was one of CISC architectures was the ability for people to write assembly code. That's why there are all those string manipulation instructions and the like: think of those old instruction sets as basically ALU manipulation plus some convenience subroutines implemented in hardware/microcode.

> The breakthrough of the 801 was realizing that with the rise of compilers, those convenience features were no longer needed and all the work required to support them was wasted and could be jettisoned.

There's a deeper reason behind the change: the designers could assume ubiquitous instruction caches.

The majority of the benefit of microcode during the age of CISC supremacy was that these systems were true von Neumann systems, and every instruction fetch competed with the bus for a data access. The CISCy microcode gave you a pseudo harvard arch where say, your memset could use all of it's cycles actually moving data around.

Once ISA designers could assume I-caches, then now you can give everyone that pseudo harvard architecture benefit for any code they want to write, not just the routines you as the ISA designer think of ahead of time and put into ROM.

As an aside, this is probably what killed off the microcoded virtual machine archs like the lisp and smalltalk machines. Almost all of their benefit was that by putting the interpreter loop into microcode, it's ucode rom fetches didn't compete with the bytecode program and data fetches. Once an I-cache was present, anyone could write their own interpreter with the same properties without having to buy custom hardware. So it wasn't the C language ubiquity that killed them off, but the ubiquitous I-cache.

gumby · on Dec 2, 2022

Never thought of the micromachine as a kind of harvard architecture (when I wrote microcode I just thought of the macroinstructions as data — this was back in the 80s) but it’s an interesting idea.

> As an aside, this is probably what killed off the microcoded virtual machine archs like the lisp and smalltalk machines.

The hardware provided other benefits (pointer and literal tagging, for example; also GC hardware). It wasn’t that C overwhelmed lisp, it was that other functional units on the workstation could provide similar functionality but with the hardware advantages of scale on a bigger customer base.

For example using the pager as a barrier for a transporting GC — essentially you are storing the tag bits in the TLB.

Also most people found interpreters weird (and still do), so speeding that up a little with custom hardware didn’t help many people.

Still, your point about the impact of the I cache is interesting.

rjsw · on Dec 2, 2022

The MIT Lisp Machine microarchitecture looks pretty much like RISC. Pipelined load/store with a single instruction size and lots of registers.

EDIT: the instructions are three-address too.

gumby · on Dec 2, 2022

It was heavily influenced by the original Lisp Machine (PDP-10) which had a simple, regular, and orthogonal instruction set, itself practically the first RISC machine.

rjsw · on Dec 3, 2022

The PDP-10 was not RISC by any definition.

gumby · on Dec 3, 2022

If you look at the old Radin or Henessy papers, indeed the statement might seem far fetched. The PDP-6/10 did not start from the perspective of a powerful compiler, for example.

But the big machines of the 60s and 70s had lots of features (as I alluded to in my root comment) for developers, like BCD support (survived into x86) string manipulation, variable length instructions etc. Just look at the Sperry, IBM and other big machines of the time.

The ‘10s instruction set, as I noted above was quite regular and could be implemented in hardware (as it was in the KA at least): simple, regular, and easily predicted. Utterly the opposite of where the CISC guys were going.

Of course the whole CPU architecture of a machine like the KA was trivial by today’s standards, with no microarchitecture, so to some degree the simplicity of design was a bottom up constraint as well, and in that regard, to loop back to the top of this comment, was the opposite of the motivations that drove the idea of “reduce” in RISC

jcranmer · on Dec 2, 2022

> I really don't understand why Intel and AMD haven't fully implemented this point: just implement the instructions that compilers use, plus the ones needed for bootstrapping and kernels. Put all the "legacy" instructions into user mode library code. It would simplify the silicon, likely reducing bugs.

There are actually very few such instructions. The BCD arithmetic instructions, the BOUND instruction, MPX (which is already removed in current architectures), arguably the entire x87/MMX instruction sets. Removing x87 is hard because it's required for i386 ABI reasons (floats/doubles are returned on the x87 stack in i386, SSE registers in x86-64). MPX is already axed, and the others stick around only for backwards compatibility (not available in x86-64) and are likely already microcoded already.

Note that the compiler already emits REP STOSB instructions as it's the fastest way to do memcpys these days.

gumby · on Dec 2, 2022

You could probably eliminate the entire 32 bit support, real mode and all of 16 bits, segmentation and such and just run it in emulation. Can modern x86 even run 8080 code?

I wonder how much that would save though. Surely the register file would be easier to implement? Benefits would come from smaller microcode (less code, fewer bugs) and any hardware needed to support it too.

dragontamer · on Dec 2, 2022

> You could probably eliminate the entire 32 bit support, real mode no such and just run it in emulation.

That would literally obsolete every single motherboard on the market, and force those motherboard makers to make a new bootup cycle.

And there's still a lot of programs that run in 32-bit Windows by the way. Like, the near entirety of Good ol Games. I still like playing SimCity 2000, Heroes of Might and Magic, and Panzer General.

> Can modern x86 run 8080 code?

8080 ? Of course not. There was a clean break to 8086.

And then we never dropped compatibility with 8086.

gumby · on Dec 3, 2022

I meant to type 8086 so thanks for answering the question I was trying to ask.

I assume most servers don’t need 32 bit mode.

Then again people are slowly moving to arm on the server anyway.

gpderetta · on Dec 2, 2022

Great point on x87 being set in stone in 32bit ABIs. Specifically Windows were 32bit software is still very common!

gpderetta · on Dec 2, 2022

Backward compatibility. They could microcode all the lesser used instructions, but the surface area of existing code is very large, and intel and AMD care more about running existing code faster than new code.

There is a reason that even the obsolete x87 floating point stack still runs a near optimal speed.

Also I don't think it is very expensive to maintain most rare instructions. The cost is primarily in encoding space, but until they support a different ISA (possibly as an alternate mode), they don't have an option.

There is also the "small" advantage that a very complex architecture is hard to implement, validate, and/or emulate, giving an advantage against the competition.

dragontamer · on Dec 2, 2022

> There is a reason that even the obsolete x87 floating point stack still runs a near optimal speed.

That's because SSE / AVX are faster than x87 floating point instructions. So modern CPUs just microcode-translate the x87 instructions into SSE / AVX micro-ops under the hood.

monocasa · on Dec 2, 2022

They do not translate x87 to SSE/AVX under the hood. It's goofy enough, (not just with the extra precision, but the status word needs to be renamed too) that it has dedicated hardware. Therefore there's a seperate register file that stores x87/mmx state (and avx-512 k mask registers).

https://travisdowns.github.io/blog/2020/05/26/kreg2.html

gpderetta · on Dec 2, 2022

I was going to say that there are no SSE/AVX micro ops and x87, SSE, AVX, AVX512 just get translated to the same internal format that implement the superset of all specific instruction behaviours, but looking at the instruction tables, for example for Ice Lake, you can see that the legacy FADD is converted to exactly one uop that is run on port 5, while ADDSS is also one uop but it can be executed on either port 0 or 1. So it seems that at least Ice Lake still has x87 specific uops.

You can see that something like the legacy FCOS is instead definitely microcoded as it expands to hundreds of uops. This has been the case for at least two decades.

jdsully · on Dec 2, 2022

> I really don't understand why Intel and AMD haven't fully implemented this point: just implement the instructions that compilers use, plus the ones needed for bootstrapping and kernels. Put all the "legacy" instructions into user mode library code. It would simplify the silicon, likely reducing bugs.

They mostly have - those esoteric instructions are slower than executing the equivalent with more common instructions yourself. Its clearly the bare minimum to support back compat with the least die area possible.

dragontamer · on Dec 2, 2022

> I really don't understand why Intel and AMD haven't fully implemented this point: just implement the instructions that compilers use, plus the ones needed for bootstrapping and kernels. Put all the "legacy" instructions into user mode library code. It would simplify the silicon, likely reducing bugs.

You mean how x87 instructions is microcode emulated on SSE, which is microcode emulated on AVX hardware? (EDIT: I had a tidbit on MMX but I think I got my history wrong there)

None of those x87 instructions "exist" anymore. The CPUs support them, but its just microcode emulation. There's no x87 stack, or 80-bit registers on modern computers anymore. Its all careful emulation.

> The guiding theme, if there is/was one of CISC architectures was the ability for people to write assembly code. That's why there are all those string manipulation instructions and the like

REP MOVSB is actually the fastest way to memcpy on Intel machines, actually, thanks to "enhanced movsb".

It turns out that a single instruction to do memcpy is a really, really good idea.

gpderetta · on Dec 2, 2022

I'm pretty sure that x87 is not microcode emulated (at least the basic arithmetic ops). I'm sure they use the same FPUs as SSE of course.

dragontamer · on Dec 2, 2022

x87 is 80-bit floating point. They literally don't fit inside of 64-bit doubles of SSE.

The extra bits need to be emulated.

EDIT: And I'm sure there's some program out there that actually relies on those extra 16-bits of precision, and they'd be pissed if their least-significant bit had a fraction-of-a-bit more error per operation.

gpderetta · on Dec 2, 2022

They are not emulated, they run at optimal latency (in fact on Ice Lake FADD has better latency than ADDSD!), although at a lower throughput as there are less dedicated execution units.

dragontamer · on Dec 2, 2022

That's a strong point. I guess they really aren't emulated then.

That really makes me wonder how the 80-bits are stored then. I guess the "stack" is just part of the register-renaming mechanism? Huh... AVX registers are 256-bits, so I guess 80-bits fits in each one.

monocasa · on Dec 2, 2022

Its own register file (that is shared with the AVX-512 K mask registers).

https://travisdowns.github.io/blog/2020/05/26/kreg2.html

gpderetta · on Dec 2, 2022

Nice find!

In fact I think I must have read this article when it made the rounds. Then I must have promptly forgotten it :)

gpderetta · on Dec 2, 2022

Yes, x86 stack per se doesn't exist anymore and it is mapped to the general register file. I have no idea how the 80 bits are handled. I thought that the AVX registers mapped to multiple entries in the file, but maybe I'm wrong.

kevin_thibedeau · on Dec 2, 2022

> just implement the instructions that compilers use

They've been doing that since Pentium Pro switched to a RISC core.

e12e · on Dec 2, 2022

Isn't this pretty much what Transmeta tried?

Joker_vD · on Dec 2, 2022

Ha, AMD Am29000 had single-step MUL and DIV instructions, that is, they did a single addition/subtraction and shift; to actually divide two numbers you literally wrote a sequence of 32 identical (except of the very first/last one) MUL or DIV instructions: look at [0], sections "7.1.6. Integer multiplication" and "7.1.7. Integer division" on pp. 203–207.

[0] http://www.bitsavers.org/components/amd/Am29000/1987_Am29000...

aap_ · on Dec 2, 2022

The PDP-1 did this too, unless it was equipped with the automatic multiply-divide extension.

monocasa · on Dec 2, 2022

The SH processors did this too.

repiret · on Dec 2, 2022

Your point is taken, but to be pedantically correct, ARMv7-R has an optional DIV instruction.

brandmeyer · on Dec 2, 2022

Its even optional in ARMv7-A and Cortex-A9 doesn't have it.

msla · on Dec 2, 2022

Here's John Mashey, a designer of MIPS, on what RISC is:

https://danluu.com/risc-definition/

START QUOTE

A: there is a very specific set of characteristics shared by most machines labeled RISCs, most of which are not shared by most CISCs.

The RISC characteristics:

a) Are aimed at more performance from current compiler technology (e.g., enough registers). OR b) Are aimed at fast pipelining in a virtual-memory environment with the ability to still survive exceptions without inextricably increasing the number of gate delays (notice that I say gate delays, NOT just how many gates).

Even though various RISCs have made various decisions, most of them have been very careful to omit those things that CPU designers have found difficult and/or expensive to implement, and especially, things that are painful, for relatively little gain.

I would claim, that even as RISCs evolve, they may have certain baggage that they'd wish weren't there ... but not very much. In particular, there are a bunch of objective characteristics shared by RISC ARCHITECTURES that clearly distinguish them from CISC architectures.

I'll give a few examples, followed by the detailed analysis:

MOST RISCs:

3a) Have 1 size of instruction in an instruction stream

3b) And that size is 4 bytes

3c) Have a handful (1-4) addressing modes) (it is VERY hard to count these things; will discuss later).

3d) Have NO indirect addressing in any form (i.e., where you need one memory access to get the address of another operand in memory)

4a) Have NO operations that combine load/store with arithmetic, i.e., like add from memory, or add to memory. (note: this means especially avoiding operations that use the value of a load as input to an ALU operation, especially when that operation can cause an exception. Loads/stores with address modification can often be OK as they don't have some of the bad effects)

4b) Have no more than 1 memory-addressed operand per instruction

5a) Do NOT support arbitrary alignment of data for loads/stores

5b) Use an MMU for a data address no more than once per instruction

6a) Have >=5 bits per integer register specifier

6b) Have >= 4 bits per FP register specifier

END QUOTE

Not having a hardware division opcode isn't on the list; in fact, the MIPS chips had hardware division, but it was odd in that it used the hi and lo registers and had an architectually-visible latency such that the compiler or human was encouraged to schedule opcodes such that they wouldn't stall the pipeline by trying to read the results of a division right after the division opcode had issued.

https://devblogs.microsoft.com/oldnewthing/20180404-00/?p=98...

The divide opcode also didn't have a divide-by-zero exception. The point is that the MIPS, like a lot of RISC designs, prioritized pipelineability over convenient assembly language behavior, and expected compilers and humans to pick up the slack and write code to implement what, in a CISC design, would have been implemented in microcode.

froh · on Dec 2, 2022

the most common optimization for release builds and the one used by Linux distrubutions is -O2, not -O3. this is justified by real world measurements, btw, alas I don't have the article link at hand on the smartphone. the quintessential learning from that article was: measure and ideally profile before going beyond -O2 .

and to see the size difference, I'd love to see -Os optimize for size used in comparison to -O2/-O3 which is unrolling loops and inlining static functions as it deems fit, beyond the inline keyword (which is a mere hint).

another paradoxical effect of increasing generated code size with aggressive optimizations is that you may outgrow caches: if you're unlucky paging into slow DDR ram becomes necessary in inner loops and the execution speed decreases.

I'd suggest to read the article with an extra grain of salt.

colejohnson66 · on Dec 2, 2022

> [...] Linux distrubutions is -O2, not -O3. this is justified by real world measurements [...]

No. It's because Linus fears -O3 for no reason. He even ordered the removal[1] of the -O3 Kconfig flag[0] (CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3) because "-O3 has a *loong* history of generating worse code than -O2"[1], whatever that means.

From [2]:

> Other upstream kernel developers also criticized that higher optimization level over the default -O2 level due to the risks, particularly with older compilers and memories from times when -O3 tended to be more buggy.

In other words, because of bugs in previous versions that have been fixed, we won't use the feature.

Linus and Co. are sticking their heads in the ground regarding -O3. He says he needs evidence -O3 is good, but doesn't actually provide evidence beyond hearsay that it's bad.

Just because Linus says -O3 is bad doesn't mean he's right.

[0]: https://www.phoronix.com/news/O3-Optimize-Kernel-2022-Patche...

[1]: https://www.phoronix.com/news/Linus-Against-O3-Kernel

[2]: https://www.phoronix.com/news/Linux-6.0-Drops-O3-Kconfig

gpderetta · on Dec 2, 2022

There is a lot of cargo culting in the -O2 decision.

But it also true that O3 enables a lot of loop optimizations that are not particularly relevant for the kernel. Also the kernel is less reliant of more aggressive inlining and interprocedural optimizations than, say, highly abstracted C++ code.

colejohnson66 · on Dec 2, 2022

For sure, -O3 isn't really necessary in the kernel, but saying, "it's so bad that we shouldn't even have a Kconfig flag for it" is a bit extreme

froh · on Dec 2, 2022

debugging ring0 code obfuscated by -O3 is another level of fun. ymmv, however the kernel guys are finding plenty of obscure bugs.

and they have been bitten by aggressive smart optimizations based off undefined behavior.

for example testing a variable for != null after dereferencing it makes no sense. if it was null, it was a segfault and the check is never reached, right?

  foo *x = f();
  x->y();
  if (x == null) {
    /* unreachable in user space! */
    do_sth_about_it();
    ...
  }

so the check can go. except when it can't:

https://news.ycombinator.com/item?id=33770589

here is another recent discussion on -O2 vs -O3 here on HN.

https://news.ycombinator.com/item?id=28895896

seriously, by default, -O2 is for production code. use -O3 only after measuring it makes _your_ code faster and introduces no obscure bugs.

mhh__ · on Dec 3, 2022

I view the null check stuff as being more of an example of the compilers/even kernel Devs not bothering to try and properly express their desires i.e. in this case the check is has wording in the standard covering it, so it should be very explicitly desired to remain in the binary (volatility or similar, although there are limits to what you can ergonomically express in C)

Similarly strict aliasing can change the behaviour of code but if you're genuinely relying on that your code is probably bad - either in the standards view, or in my view in that you can write the same code in a manner that won't cause any mischief (i.e. there are standard-friendly ways to do ugly pointer crap even if they mean memcpying pointers - which will then be eliminated by the optimizer)

froh · on Dec 3, 2022

> there are limits to what you can _ergonomically_ express in C...

indeed.

and in the context of this thread all I'm saying is that "-Oxxx is best, -Oyyy is outdated" is an oversimplification.

spookie · on Dec 2, 2022

If it led to bugs in the past, then the ones who need to defend their opinion are -O3 defendants.

Look, the guy usually comes around when shown facts. He's making sure it's a good decision.

fweimer · on Dec 2, 2022

Inlining heuristics used for -O3 are architecture-specific, so this just shows that things have been tuned in a particular way, probably based on the most prevalent icache sizes. -O2 numbers might be different because of reduced inlining. I would expect the -O2 numbers to reflect better the actual ISA capabilities.

The choice of sub-architecture matters as well. It's probably not fair to compare against outline atomics. In my experiments at least, VEX was more compact even for scalar code, etc.

inkyoto · on Dec 3, 2022

> Inlining heuristics used for -O3 are architecture-specific […]

Yes, the inlining strategy is non-deterministic and does not yield same results across different ISA. Moreover, it does not even guarantee the same result for a single ISA if the code was compiled for a specific submodel of a CPU with a different instruction cache line size.

Especially for the x86 ISA where instructions are variable length, it is common for instructions to inadvertantly spill over into the next I-cache line thus yielding a substantial performance penalty for a performance critical/sensitive code path. Therefore, a common technique for the optimiser was (I have not checked recently tho) to take into account the I-cache line size, group instructions in such a way that if a fat code sequence were to cross the cache line, fill in the rest of the cache line with NOP's and place the fat instruction into the next cache line. Such a problem is nearly non-existant for RISC ISA's although I would surmise that one has to watch out for tight loops anyway.

> I would expect the -O2 numbers to reflect better the actual ISA capabilities.

I would go on to add that today «-O3 -fno-inline-functions» would give a more accurate and faithful reflection of generic ISA capabilities today. For a long time, «-O3» was no more than «-O2 -finline-functions»; however, since then further optimisations have been added to «-O3» that are rather useful generic optimisations for modern CPU's (i.e. loop vectorisation and more). The article is especially lacklustre in this particular regard as the author does not go beyond generic bloviations and does not make an attempt to understand what hides beyond «-O3».

mbitsnbites · on Dec 2, 2022

> I would expect the -O2 numbers to reflect better the actual ISA capabilities.

I will have a more thorough look at -O2 too, good point.

However w.r.t. "ISA capabilities" the simple fact is that compilers don't use CISC style instructions. They prefer RISC style load/store + many GPRs code, and new x86_64 instructions are modelled after this paradigm. Hence I wouldn't expect there to be a huge difference at -O2 (and in my experience x86_64 often loses at -O2 too).

fweimer · on Dec 2, 2022

I think for x86-64, the 32-bit immediate operands totally qualify as CISC-style because they contribute to the variable length nature of the instruction encoding. And GCC uses them all the time, of course. GCC has a surprising tendency to use the string instructions for memset and memcpy, too. In the other direction, without -march=x86-64-v3, GCC cannot even use most of the more RISC-style three-operand instructions.

mbitsnbites · on Dec 2, 2022

The 32-bit operands are actually hurting code density most of the time. An instruction that would take 4 bytes in a RISC ISA takes at least 5 bytes in a CISC ISA.

Same thing with the 32-bit offsets for x86_64 conditional branches.

Edit: With "CISC style" I meant those instructions that use implicit registers instead of GPRs and/or do fancy multi-operation stuff, like LOOP or POPA/PUSHA etc.

ajross · on Dec 2, 2022

Came here to say exactly this. And in fact x86_64 has historically gotten a ton more tuning and (at least in my experience) will more aggressively inline at pretty much all optimization levels.

I really don't know why an article about code density chose that level instead of -Os or -O2.

mbitsnbites · on Dec 2, 2022

Simple: Optimizing for small code size is uninteresting for 99% of the software that a CPU runs. What matters is performance (especially when you argue that dense code is good for performance), and most real world software is optimized for performance.

I get the argument of -O2 vs -O3 and I'll be sure to measure that. However my experience is that x86_64 does pretty poorly in -O2 too compared to RISC.

gpderetta · on Dec 2, 2022

Sure, but a static code size analysis has very very little to do with performance as you will be weighting equally code that is executed millions of times with code that is never executed.

mbitsnbites · on Dec 2, 2022

I agree, but in absence of an accurate method this felt like the second best method.

Then there's also the elephant in the room that I'd like to write an article about some time: The decoder/translator + uop-cache in the front end has a devastating effect on instruction fetch & decode performance. That silicon eats power, could be used for better things (larger L1I cache etc), adds latency, limits how wide you can decode, and so on.

Rationale: CISC is not just about density. With good RISC you can get much better fetch & decode bandwidth (all other things being equal). E.g. see Apple silicon.

gpderetta · on Dec 2, 2022

I don't think that x86 implementations are transistor limited. In fact Intel had to slap giant vector ALUs to find an use for them. And x86 L1 size is unfortunately limited by the page size, so you can't really trade one for the other.

Complex decoders do consume power of course, but I don't think they have a huge effect on the thermal budget. I also don't think they have a huge effect on latency and the uop L0 cache actually improves latency.

They make it harder to scale to higher width of course, but it seems that it hasn't been a huge obstacle so far.

mbitsnbites · on Dec 2, 2022

One of the problems with the decoder is that it's "always on" so it always draws power (unlike the SIMD unit, for instance).

I also think that the uop L0 cache is closer to where the L1I cache is in a fixed width RISC implementation. The L1I cache of an x86 machine is quite far away from dispatch (in terms of pipeline stages). I think that z/Arch has something like 5-10 stages between L1I and dispatch, for instance.

And if you start comparing the uop cache with an L1I$ of a fixed width RISC machine, things dont't look good for CISC (the uop cache is extremely inefficient in terms of capacity/silicon, only holding a handful kuops). It's probably not an entirely fair comparison, but neither is comparing the L1I$ of a CISC machine with that of a RISC machine.

I don't think it has so much to do with being transistor limited as it has to do with keeping the latency sensitive parts tight and avoiding unnecessary pipeline stages etc. It's "easy" to throw transistors on L2 & L3 cache, but minimising branch misprediction penalties and keeping a wide pipeline 100% fed with instructions all the time is trickier.

muricula · on Dec 2, 2022

Why would x86 L1 be limited by the page size? The cache works at 128 byte granularity. If it's too big it might increase tlb pressure I suppose, but it makes sense that increases in L1 would be best accompanied by increases in tlb size.

gpderetta · on Dec 2, 2022

It is a quirk of VIPT caches. After a certain point you can't increase index size, only ways and quickly reach diminishing returns.

There might be of course workarounds, but intel L1 cache sizes have been more or less stable for about a decade

mbitsnbites · on Dec 3, 2022

I believe that ARM has a patent on a workaround. Apple M1 has 192KB L1I.

gpderetta · on Dec 3, 2022

Apple has a larger page size though.

mbitsnbites · on Dec 3, 2022

The article has been updated with -O2 numbers.

Spoiler alert: ARM & RISC-V still win over x86 & z/Arch.

chasil · on Dec 2, 2022

It is far more impactful upon code density to compile 32-bit binaries, and eschew 64-bit.

Solaris was known to do precisely this for everything in /bin. I recently saw this in action on a copy of SmartOS, and I imagine that OpenIndiana does the same.

It's a shame that no major Linux distribution has openly performed this analysis for x86-64 and AArch64. ARM Thumb is also the smallest Busybox available for ARM 32-bit.

On the other hand, two copies of libc are essentially thrashing the instruction cache.

pclmulqdq · on Dec 2, 2022

Most common x86-64 instructions are 4 bytes already. The next most common instruction size used in many compiled 64-bit programs is 3 bytes, followed by 5 bytes.

I believe that AArch64 still uses 4-byte encodings for many instructions too, with a few that need 8 bytes.

chasil · on Dec 2, 2022

Looking closer, I see that the difference between the latest busybox i686 and x86_64 versions is not large:

  busybox (i686)   2022-01-17 17:40  1.0M
  busybox (x86_64) 2022-01-17 18:53  1.1M

However, the savings on ARM Thumb appears to be larger, even for the various 32-bit binaries (v7m is Thumb, as I understand it):

  busybox-armv7m 2019-06-10 14:02  867K
  busybox-armv5l 2019-06-10 14:02  1.1M
  busybox-armv7l 2019-06-10 14:02  1.1M
  busybox-armv7r 2019-06-10 14:02  1.1M
  busybox-armv8l 2019-06-10 14:02  1.1M

Strange that Super-H is this large, as it inspired Thumb:

  busybox-sh4    2019-06-10 14:02  1.0M

https://busybox.net/downloads/binaries/

pclmulqdq · on Dec 2, 2022

Thumb gives a lot of 2-byte instructions (like the RISC-V C extension), which makes sense for it to significantly reduce code size. ~20-30% is about the savings you would expect.

x86 has the huge disadvantage that it can't use 2-byte encodings for common instructions, since a lot of those are used by 16-bit and 8-bit instructions (which are almost completely unused today).

retrac · on Dec 2, 2022

SH4 is relatively compact (16-bit instructions) but it isn't really compressed in the same way Thumb is. They shaved some corners to make things fit; e.g. there's only room for a 4 bit displacement in a load [R + disp] -> R type instruction. If you want to use an 8 bit displacement you have to use R0. This naturally leads to a lot of two-instruction sequences to calculate an offset into the stack, etc. which ends up practically more like classic MIPS in density.

mbitsnbites · on Dec 2, 2022

Code density is not all about instruction length. It's also about instruction count. With shorter instructions you usually need to use more instructions to do the same thing (e.g. extra MOVs or stack spill etc).

chasil · on Dec 2, 2022

The[/your?] article mentions two current concerns with code density, "Cache hit ratio," and "Instruction fetch bandwidth."

Does the footprint of Thumb (including total bytes, count of opcodes, speed, likely other things that I have not thought of) impact the conclusions in the paper?

Interesting that Thumb has been removed from AArch64, and that Intel never added anything like it.

mbitsnbites · on Dec 2, 2022

My experience from Thumb was that it makes code slower. Probably because the CPU had to execute more insteuctions than in "ARM" mode.

Thumb was a thing for embedded systems with very limited RAM etc. It was not designed for optimal speed.

My guess is that ARM64 targeted higher end, and that severe memory constraints were no longer considered a big issue. ARM has its Cortex M4 (Thumb only microcontrollers) and the likes for those markets.

adrian_b · on Dec 2, 2022

The comparison was not really unbiased, because for RISC-V the compressed instruction extension was used, which artificially makes RISC-V appear to have shorter programs.

RISC-V with the compressed instruction extension should be compared only with ARMv8-M (i.e. Thumb2), nanoMIPS or other such ISA's, which target similar levels of CPU performance, not with ARMv8-A, which is intended to be implemented at much higher levels of performance.

The graph with the number of instructions reveals that the RISC-V programs were not shorter than the AArch64 programs, but longer by more than 12% (all AArch64 and RISC-V instructions have the same length).

RISC-V has only one great feature that is the cause of a significant program length reduction, the combined compare-and-branch instructions, which save one instruction, i.e. one 32-bit word, at each conditional branch, i.e. at every 4 to 5 instructions, in comparison with AArch64.

However the other weaknesses of RISC-V are great enough that even with the compare-and-branch advantage the RISC-V programs end longer than the AArch64 programs.

monocasa · on Dec 2, 2022

RV-C is intended to be implemented on high perf systems; it's not some tiny, embedded core only extension like ARMv8-m.

And I don't think it's misleading, instruction count isn't the only metric that matters. I$ pressure is very real. So they show both metrics.

gpderetta · on Dec 2, 2022

It is meant for high performance implementations. But do they exist already?

monocasa · on Dec 2, 2022

The numbers have been run on simulations of large wide cores, and the benefit of RV-C is pretty clear. Although since the release of the M1, I agree that there's probably a need for a BOOMv4 to publicly explore the problem space.

Going into rumor town: my understanding is that all of the companies working on high perf core are implementing RV-C, including those made of ex-Apple employees who worked on their cores. The tiny bit of extra decode complexity more than pays dividends in I$ pressure (which from a design perspective can let you get away with less I$, and therefore lower latency I$).

gpderetta · on Dec 2, 2022

That's interesting! Is there a short reference on RV-C?

knorker · on Dec 2, 2022

Using RISC-V compressed instructions is not cheating. It's part of the design. It would not be a realistic comparison if you arbitrarily left it out.

imtringued · on Dec 3, 2022

In RISCV compressed instructions are just an extension that you can mix with uncompressed instructions. There is no reason to not use them. Thumb2 is a completely different instruction set that is mutually exclusive with the uncompressed instruction set.

If you want to be pendantic then RISC V compression is just adding more 32bit mini VLIW instructions.

AtlasBarfed · on Dec 2, 2022

I think we need to reexamine the "code size doesn't matter". Because as serial code execution has petered out, know what a great way to speed things up is? Keep your code in cache.

if serial code speed is done in improvement, you can (especially with all the excess silicon real estate since moore's law the transistor count is progressing still) hardware implement code.

Or you can fit it in the closest/fastest cache to CPUs there is.

Which is why I think while the last two decades of computing were owned by Java, Python, Ruby, and Javascript, we will swing back to far less bloated languages in the next thirty years and start streamlining.

CISC approaches may be needed in code to annotate what they do, with exploding core counts, you also need what the JVM does on steroids: you want something that can intelligently schedule compiled code against many cores. Right now, this is the domain of the programmer.

Wave the magic AI wand?

gpderetta · on Dec 2, 2022

The thing is that code prefetches extremely well due to good branch prediction. On the other hand a cache miss can worsen the misprediction penality.

gpderetta · on Dec 2, 2022

> Dynamic code density

[...]

> We will not go down that rabbit hole.

It seems to me that, unless you are on an embedded system, the dynamic code density is the only thing that matter. A proxy is to simply measure the I$ hit rate for similar pieces of code on similar machines, although this is very machine dependent (then again, code size in a vacuum is not terribly important).

Measuring the static space taken by instructions is also very misleading as some RISCs do not have a good support for inline constants and need to load them from the constant pool (which I think is not in the code segment) and has non trivial effects on cache efficiency.

Then again, with all prefixes, x86 is not particularly dense.

mbitsnbites · on Dec 2, 2022

Just using I$ hit ratio is problematic in many ways. E.g:

- You'll probably not find implementations of different ISAs with identical cache configurations (size, associativity etc).

- It says little about what work is actually done (different ISAs = insns do different amounts of work).

- On x86 all bets are off w.r.t. the effect of the uop cache on the L1I cache hit ratio, and the uop cache hit ratio can't be compared to any other machine.

- You need to reproduce the same program flow on different architectures to be able to compare the numbers.

...etc.

I think that the only reasonable way to do it is to have a multi-ISA simulator where you are in full control of all these aspects. And it would be really hard work.

gpderetta · on Dec 2, 2022

Re 2, the work per instruction doesn't matter if you compare the same program/program execution, in practice you will get an estimated of the resident set size over the amount of work.

All your other points do stand and that's what I mean with 'is very machine dependent'. And yes, if you want to isolate fully the effect of instruction density an emulator might be the only solution. Still I think that profiling counters can get you 90% there.

mbitsnbites · on Dec 2, 2022

I just don't see how I would run profiling counters on a z15 machine for instance ;-)

With my limited resources this got me much more architecture coverage.

And as usual, the answers are in the data, it's merely a matter of what questions you think that you are asking...

gpderetta · on Dec 5, 2022

Yes, z15 is an issue :)

Still an arm vs x86 should be doable and maybe even riscv.

Symmetry · on Dec 2, 2022

The "Network transfer speed / cost." issue was worse than this make it seem. In the first CISC designs you were reading in the instruction one byte at a time over the same bus that you would use to load or store data. An extra byte in the instruction meant that the instruction took an extra cycle to complete.

bell-cot · on Dec 2, 2022

> ...In the first CISC designs...

In the x86 world, this issue applied all the way through the 80386 CPU. Or through the 80486, if you're a bit harsher in judging it's unified (code + data) 8KB cache*.

(True, not one byte at a time once you got to the 8086. But the 32-bit 80386 and 80486 only had 32-bit busses, and those were still serious bottlenecks.)

*The cache was "write-through" on all but a very few variants, yielding extra congestion on the bus.

akuma73 · on Dec 2, 2022

This isn't totally surprising.

x86 vs. ARM/RISC-V has fewer registers which can mean more spill/fill to the stack which results in more instructions to do the same work.

The instruction decode length for x86 has been creeping up with 64-bit encodings and SIMD extenstion (prefix bytes) as you can see it's 3.96 for x86 vs. 4.00 for ARM.

elygre · on Dec 2, 2022

Whenever risc va cisc is discussed, I remember reading that it’s not a “reduced set of instructions”, but rather a “set of reduced instructions”. I always think of various addressing modes, and rather having the divide instruction know about these, risc would force you to handle those in separate steps.

babel_ · on Dec 2, 2022

Reaching for -O3 as the only option feels like a problem with the methodology. Is there the chance for -O2 or -Os as supplementary graphs? Or stacked / marked onto the existing graphs?

-O3 is optimising for speed at a space-cost, after all, so it hardly feels like the absolute-correct option for this (hence my suggestion for, at the very least, -O2 as well, and -Os would be nice for contrast of "how dense" they _can_ get).

mbitsnbites · on Dec 2, 2022

Will have a look at -O2, time permitting.

codedokode · on Dec 2, 2022

Totally unfair in my opinion:

1) They have excluded i386. It is well known that 64-bit archs waste lot of memory and 32-bit arch could probably save something

2) They should have disabled position-independent code generation because on i386 it takes more memory

3) Instead of optimizing for speed whole program it is better to optimize for speed only "hot" parts and optimize the rest for size. Or optimize everything for size.

bell-cot · on Dec 2, 2022

> 1) They have excluded i386...

From a few old articles, my impression was that x86-64 code is, in a fair number of cases, notably denser & faster than the i386 equivalent. Main reason why - only 8 architectural registers for the i386 code to use, vs. 16 for the x86-64 code. So the i386 CPUs can waste a lot of instructions & time shuffling data between registers and memory, because they've run out of registers.

knorker · on Dec 2, 2022

1) The article is checking what's the case in 2022. In my opinion it's quite reasonable to not let 32bit compete at all. (though there is a 32bit RISC there)

2) do you mean also on 64bit x86? Yeah, probably would be nice to see. But then again I would say that today's code is position independent. It's not useful with choice of ISA to compare a mode you would not run anyway.

3) That's not the argument this article aims (successfully) to debunk, though.

gpderetta · on Dec 2, 2022

Re point 2, AMD64 supports IP relative addressing, so PIC code is more efficient than x86 and it is not penalized compared to the RISCs.

mbitsnbites · on Dec 2, 2022

1) i386 is not representative of bleeding edge CPU architectures (as mentioned in the article). It would only be icluded for purely academic reasons - which was out of scope.

2) ...

3) In the real world most programs use a single optimization/tuning config for the entire program.

The article aimed to analyze real world programs running on contemporary modern architectures.

codedokode · on Dec 2, 2022

> i386 is not representative

Would it matter if it had better code density? "Modern" doesn't necessarily mean that it is better in every aspect. For example, there were articles claiming that same application compiled to 64 bits uses more memory than 32-bit version.

mbitsnbites · on Dec 2, 2022

It wouldn't matter. I still would not buy an i386, compile my programs in 32-bit mode instead of 64-bit, nor use the i386 ISA as a model when designing a new ISA.

Same thing with 6502, Z80, Vax, etc.

What matters is performance, and i386 code does not give as good performance as x86_64 code or modern RISC code (it doesn't have as many GPRs etc so it can't).

pizlonator · on Dec 2, 2022

I don’t buy it.

There’s nothing preventing GCC from having different inlining/outlining heuristics as well as different cost models for versioning based on target.

Unless you control for that, the results are meaningless.

My own tests say that x86 and arm thumb2 are about equally compact and everything else is fatter. But even those tests probably lacked all the controls you’d need to get it right.

Joker_vD · on Dec 2, 2022

> There’s nothing preventing GCC from having different inlining/outlining heuristics as well as different cost models for versioning based on target.

Does it actually have that different logic? Because otherwise this sounds like that tired "nothing prevents the compiler to be arbitrarily smart" argument — sure, nothing prevents that except for the fact that someone has to actually implement this smartness.

pizlonator · on Dec 4, 2022

I know that llvm has different logic.

I know that every compiler that I’ve ever written ends up having different logic for this based on target, for a myriad of complex reasons.

So I would assume GCC landed in the same place as anyone writing a complex compiler.

mbitsnbites · on Dec 2, 2022

The results mean what they mean: If you compile your software with GCC -O3 you will typically get smaller code and fewer instructions on average for modern fixed size RISC ISAs than for modern CISC ISAs.

You would have to do further and other investigations to draw any other conclusions.