I totally agree with the core of your argument (aarch64 decoding is inherently simpler and more power efficient than x86), but I'll throw out there that it's not quite as bad as you say on x86 as there's some nonobvious efficiencies (I've been writing a parallel x86 decoder).
What nearly everyone uses is a 16 byte buffer aligned to the program counter being fed into the first stage decode. This first stage, yes has to look at each byte offset as if it could be a new instruction, but doesn't have to do full decode. It only finds instruction length information. From there you feed this length information in and do full decode on the byte offsets that represent actual instruction boundaries. That's how you end up with x86 cores with '4 wide decode' despite needing to initially look at each byte.
Now for the efficiencies. Each length decoder for each byte offset isn't symmetric. Only the length decoder at offset 0 in the buffer has to handle everything, and the other length decoders can simply flag "I can't handle this", and the buffer won't be shifted down past where they were on the next cycle and the byte 0 decoder can fix up any goofiness. Because of this, they can
* be stripped out of instructions that aren't really used much anymore if that helps them
* can be stripped of weird cases like handling crazy usages of prefix bytes
* don't have to handle instructions bigger than their portion of the decode buffer. For instance a length decoder starting at byte 12 can't handle more than a 4 byte instruction anyway, so that can simplify it's logic considerably. That means that the simpler length decoders end up feeding into the higher stack up full decoder selection, so some of the overhead cancels out in a nice way.
On top of that, I think that 5% includes pieces like the microcode ROMs. Modern ARM cores almost certainly have (albeit much smaller) microcode ROMs as well to handle the more complex state transitions.
Once again, totally agreed with your main point, but it's closer than what the general public consensus says.
I wonder whether a modern byte-sized instruction encoding would sort of look like Unicode, where every byte is self synchronizing... I guess it can be even weaker than that, probably only every second or fourth byte needs to synchronize.
Honestly, I think modern (meaning wide, multiple instruction decoders, and designed today without back compat concerns) and byte-sized are sort of mutually exclusive. Most of those ISAs were designed around 8-bit data buses, and having simple ops only consume a single memory read cycle was pretty paramount to competitive performance. Without that constraint, there's probably better options.
IMO, you would either go towards bitaligned instructions like the iAPX 432 or the Mill, or 16-bit aligned variable width instructions like the s360 and m68k on the CISC side, and ARM Thumb and RV-C on the RISC side.
That being said, you're definitely thinking about it the right way. Modern Istream bandwidth conscious ISAs absolutely (and perhaps unsurprisingly) look at the problem from a constrained, poor man's huffman encoding perspective similar to how UTF-8 was conceived.
Interestingly Thumb2 was dropped when going from Arm32 to Arm64. Perhaps the encoding was getting really complicated, and would've been even harder with 32 registers, and not being able to save a lot of memory (if many instructions use 4 bytes anyway).
Maybe one could come up with an instruction encoding that encodes some number of instructions per cache line. Every time the cpu jumps to a new instruction (at cache line address + index), the whole cache line needs to be loaded into icache anyway, and could get decoded then -> internally they get represented in microcode anyway.
What nearly everyone uses is a 16 byte buffer aligned to the program counter being fed into the first stage decode. This first stage, yes has to look at each byte offset as if it could be a new instruction, but doesn't have to do full decode. It only finds instruction length information. From there you feed this length information in and do full decode on the byte offsets that represent actual instruction boundaries. That's how you end up with x86 cores with '4 wide decode' despite needing to initially look at each byte.
Now for the efficiencies. Each length decoder for each byte offset isn't symmetric. Only the length decoder at offset 0 in the buffer has to handle everything, and the other length decoders can simply flag "I can't handle this", and the buffer won't be shifted down past where they were on the next cycle and the byte 0 decoder can fix up any goofiness. Because of this, they can
* be stripped out of instructions that aren't really used much anymore if that helps them
* can be stripped of weird cases like handling crazy usages of prefix bytes
* don't have to handle instructions bigger than their portion of the decode buffer. For instance a length decoder starting at byte 12 can't handle more than a 4 byte instruction anyway, so that can simplify it's logic considerably. That means that the simpler length decoders end up feeding into the higher stack up full decoder selection, so some of the overhead cancels out in a nice way.
On top of that, I think that 5% includes pieces like the microcode ROMs. Modern ARM cores almost certainly have (albeit much smaller) microcode ROMs as well to handle the more complex state transitions.
Once again, totally agreed with your main point, but it's closer than what the general public consensus says.