I just watched a video on branchless programming, and it blew my mind that the guy got a 3.36s function to return in .48s by using more code. The idea of trying to outsmart a compiler or runtime is motivating me to pick back up coding after 8 years away.
I think it’s a sign our chipsets and tooling have become too complicated. We’re at this weird point where both sides are trying to guess what the other one is doing.
The Mill (seems to be chugging along if any one is looking for an update) is an interesting take on the problem.
If you're not familiar its a (in development) exposed pipeline (static scheduling) VLIW ISA - not unheard of in DSP land but genuinely alien compared to a CISC scalar processor a la X86 or ARM. I'm not particularly optimistic about their chances of making into the shops but they haven't failed yet.