The various mentioned power consumption amounts are 4-10% per-core, or 0.5-6% of package (with the caveat of running with micro-op cache off) for Zen 2, and 3-10% for Haswell. That's not massive, but is still far from what I'd consider insignificant; it could give leeway for an extra core or some improved ALUs; or, even, depending on the benchmark, is the difference between Zen 4 and Zen 5 (making the false assumption of a linear relation between power and performance, at least), which'd essentially be a "free" generational improvement. Of course the reality is gonna be more modest than that, but it's not nothing.
ARM doesn't need the variable-length instruction decoding though, which on x86 essentially means that the decoder has to attempt to decode at every single byte offset for the start of the pipeline, wasting computation.
Indeed pretty much any architecture can benefit from some form of op cache, but less of a need for it means its size can be reduced (and savings spent in more useful ways), and you'll still need actual decoding at some point anyway (and, depending on the code footprint, may need it a lot).
More generally, throwing silicon at a problem is, quite obviously, a more expensive solution than not having the problem in the first place.
x86 processors simply run a instruction length predictor the same way they do it for branch prediction. That turns the problem into something that can be tuned. Instead of having to decode the instruction at every byte offset, you can simply decide to optimize for the 99% case with a slow path for rare combinations.
RISC doesn't imply wasted instruction space; RISC-V has a particularly interesting thing for this - with the compressed ('c') extension you get 16-bit instructions (which you can determine by just checking two bits), but without it you can still save 6% of icache silicon via only storing 30 bits per instruction, the remaining two being always-1 for non-compressed instructions.
Also, x86 isn't even that efficient in its variable-length instructions - some half of them contain the byte 0x0F, representing an "oh no, we're low on single-byte instructions, prefix new things with 0F". On top of that, general-purpose instructions on 64-bit registers have a prefix byte with 4 fixed bits. The VEX prefix (all AVX1/2 instructions) has 7 fixed bits. EVEX (all AVX-512 instructions) is a full fixed byte.
ARM64 instructions are 4 bytes. x86 instructions in real-world code average 4.25 bytes. ARM64 gets closer to x86 code size as it adds new instructions to replace common instruction sequences.
RISC-V has 2-byte and 4-byte instructions and averages very close to 3-bytes. Despite this, the original compressed code was only around 15% more dense than x86. The addition of the B (bitwise) extensions and Zcb have increased that advantage by quite a lot. As other extensions get added, I'd expect to see this lead increase over time.
x86-64 wastes enough of its address space that arm64 is typically smaller in practice. The RISC-V folks pointed this out a decade ago - geomean across their SPEC suite, x86 is 7.3% larger binary size than arm64.
So there’s another small factor leaning against x86 - inferior code density means they get less out of their icache than ARM64 due to their ISA design (legacy cruft). And ARM64 often has larger icaches anyway - M1 is 6x the icache of zen4 iirc, and they get more out of it with better code density.
That stuff is WAY out-of-date and was flatly wrong when it was published.
A715 cut decoder size a whopping 75% by dropping the more CISC 32-bit stuff and completely eliminated the uop cache too. Losing all that decode, cache, and cache controllers means a big reduction in power consumption (decoders are basically always on). All of ARM's latest CPU designs have eliminated uop cache for this same reason.
At the time of publication, we already knew that M1 (already out for nearly a year) was the highest IPC chip ever made and did not use a uop cache.