I’m sure this is doable, but I would certainly count it as “complex”.
But fundamentally, a given chip, dedicating a given area to the task, can only begin to decode at so many positions per cycle. And the more intelligent it tries to be about where to start decoding, the longer into that cycle it needs to wait.
And one nastiness about x86 is that you have to decode pretty far into an instruction to even determine its likely length. You can’t do something like looking up the likely length in a table indexed by the first byte of an instruction.
I wonder whether modern chips have pipelined decoders.
All modern chips have pipelined decoders, including ARM ones. For example, the Cortex A72 has three decode stages, and it's running a 3-wide decoder at low clock speeds.
IIRC Jim Keller said in some interview that modern x86 uses prediction and speculation (similar to branch prediction), and it works surprisingly well.