It’s mainly (I think) the distribution of instruction lengths. x86 instructions have any length from 1 to 15 bytes. A decoder wants to decode multiple instructions per cycle, and it generally does this in parallel, by simultaneously deciding at multiple starting points. With a fixed length ISA, to decode n instructions, you just decode them. With x86, if you simultaneously decode at offset 0, 1, …, 7, you have 8 decoders but are only likely to decode a couple of correct instructions. The rest start in the middle of an instruction and need to be discarded. So you either need many more parallel decoders for the same throughput or a more complex system to try to avoid throwing away so much work.
I’m sure this is doable, but I would certainly count it as “complex”.
But fundamentally, a given chip, dedicating a given area to the task, can only begin to decode at so many positions per cycle. And the more intelligent it tries to be about where to start decoding, the longer into that cycle it needs to wait.
And one nastiness about x86 is that you have to decode pretty far into an instruction to even determine its likely length. You can’t do something like looking up the likely length in a table indexed by the first byte of an instruction.
I wonder whether modern chips have pipelined decoders.
All modern chips have pipelined decoders, including ARM ones. For example, the Cortex A72 has three decode stages, and it's running a 3-wide decoder at low clock speeds.
So you could conceivably afford a wide frontend if you restricted x86 to a subset (64 bit only, drop a bunch of weird CISC'y instructions).