In terms of decoding bandwidth I'm not sure how many instructions it can actually sustain but it's not like it's 10x - M1 is a basically a very wide version of a tried and tested formula rather than a wholely new thing.
x86 decoders are massive. They are about the same size as the integer ALUs in current designs. I think it was an Anandtech interview a couple years ago where someone from AMD said that wider decoders were a no go because of the excessive power consumption relative to performance increase. I’m sure they’ve looked into this exact idea many times from many different angles.
ARMs all 32-bit instructions make the decoder trivial in comparison. To parallelize 10kb worth of instructions across 8 decoders means read 32 bytes into the decoders, jump 32 more bytes and do it again (yes, it’s slightly more complex than that, but not too much).
x86 instructions are 1-15 bytes. How do you split up to ensure minimal overlap and that one decoder isn’t bottlenecking the processor? How do you speed up parsing one byte at a time? uOp cache and some very interesting parsing strategies help (there’s a couple public papers of those topics from x86 designers). They can’t eliminate the waste or latency issues though.
What is amazing to me is their efficiency despite the limitations. When you look at their massive 64-core chips and account for all the extra cache, IO, and interconnect necessary, it seems like scaling up the M1 to those levels would result in a less power efficient chip by (20% or so).
In terms of decoding bandwidth I'm not sure how many instructions it can actually sustain but it's not like it's 10x - M1 is a basically a very wide version of a tried and tested formula rather than a wholely new thing.