I'm pretty sure the patten of allowing the branch predictor to run ahead is pret...

I'm pretty sure the patten of allowing the branch predictor to run ahead is pretty common.

At least, it's common to have multi-level branch predictors that take a variable number of cycles to return a result, and it makes a lot of sense to queue up predictions so they are ready when the decoder gets to that point.

But I doubt the idea of parallel decoders makes any sense out side of x86's complex variable length instructions.

It (probably) makes sense on x86 because x86 cores were already spending a bunch of power on instruction decoding and the uop cache.

> Plus i would assume it makes the cost of miss-predict even higher.

It shouldn't increase the miss-predict cost by too much.

The new fetch address will bypass the branch-prediction queue and feed directly into one of the three decoders. And previous implementations already have a uop queue between the decoder and re-name/dispatch. It gets flushed and the first three uops should be able to cross it in a single cycle.