Instruction decoding is more power efficient on arm, but x86 has solved it as a perf bottleneck, with the trace/uop caches and by doing some speculative work in the decoders. (Parallel decoding is also old hat and not a M1 or ARM land invention, it's trivial with RISC style insn format.). What other tricks do you have in mind?
More broadly, as to why the ISA doesn't make a big difference: The major differences are at the microarchitecture level since OoO processors have such flexible dataflow machinery in them that you can kind of view the frontend as compiler technology. x86 and ARM are decades-old ISAs that have seen a many many rounds of iteration in form of added instructions and even backwards incompatible reboots at the 64-bit transition points so most hinderances have been fixed.
In the olden days ISAs were important because processors were orders of magniture simpler, and instructions were processed as-is very statically (to the point that microarchitectural artifacts like branch delay slots were enshrined in some ISAs). This meant that eg the complexity of individual instructions could a bottleneck to how fast a chip could be clocked. Or in CISC land your ISA might have been so complex that the CPU was a microcoded implementation of the ISA and didn't have any hardwired fast instructions...
More broadly, as to why the ISA doesn't make a big difference: The major differences are at the microarchitecture level since OoO processors have such flexible dataflow machinery in them that you can kind of view the frontend as compiler technology. x86 and ARM are decades-old ISAs that have seen a many many rounds of iteration in form of added instructions and even backwards incompatible reboots at the 64-bit transition points so most hinderances have been fixed.
In the olden days ISAs were important because processors were orders of magniture simpler, and instructions were processed as-is very statically (to the point that microarchitectural artifacts like branch delay slots were enshrined in some ISAs). This meant that eg the complexity of individual instructions could a bottleneck to how fast a chip could be clocked. Or in CISC land your ISA might have been so complex that the CPU was a microcoded implementation of the ISA and didn't have any hardwired fast instructions...