*Of course, this simple function cannot be a replacement for proper analysis, bu...

rwmj · on May 15, 2023

They use a lot of the original 8087 FP instructions (very dense because it's basically a bytecode stack machine). Plus tricks like deriving constants from bytes in the code segment. And you can assume the contents of registers when you enter the code.

Pyrit, a ray tracing demo in 256 bytes, does all of that: https://www.pouet.net/prod.php?which=78045

You probably wouldn't want your general purpose compiler doing this sort of thing! The resulting code would be suboptimal and fragile.

brigade · on May 14, 2023

What percentage of code that a CPU will run over its lifetime is demoscene code? Heck, even of just simple hand-optimized assembly a CPU is likely to encounter, what percentage is not vector code? Because x86 vector code typically averages more than 4 bytes per instruction, and I have a suspicion that at least five nines of scalar instructions a CPU executes were generated by a compiler.

userbinator · on May 15, 2023

I mention that to point out the code density limits of x86 are much higher than what measurements using compiler output will show, while on the other hand I haven't seen the same for ARM and suspect that one can't really get much better than compiler output for it or other RISCs.

Having had to patch binaries on multiple occasions by inserting instructions, it is definitely not hard to do so for x86 as one can easily find "slack" that the compiler left behind[1], but I once had to do it for a MIPS binary, and it was definitely not easy to squeeze in the few extra instructions I needed inline; I ended up having to detour to another area with jumps instead.

Here's an old paper where the authors tried to optimise for code density manually, and you can consistently see x86 beating ARM and MIPS:

https://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_den...

[1] See https://news.ycombinator.com/item?id=15720923 for an example.

brigade · on May 15, 2023

Yeah if code size is the only metric you care about. The second link is an excellent example of code you do not want a compiler to generate by default. Like, besides all the well-known performance pitfalls of microcoded instructions, jeczx is unfusable on I think all relevant CPUs, so it’s both an additional uop and an additional cycle of latency over a tst/jz sequence.

saagarjha · on May 16, 2023

Five nines is really high. I don't think this is true, probably because language runtimes have hot paths that are typically implemented by hand. If we drop "scalar" then of course you're dropping below even two nines because of the implementation of str* and mem*.