I'm pretty sure the patten of allowing the branch predictor to run ahead is pretty common.
At least, it's common to have multi-level branch predictors that take a variable number of cycles to return a result, and it makes a lot of sense to queue up predictions so they are ready when the decoder gets to that point.
But I doubt the idea of parallel decoders makes any sense out side of x86's complex variable length instructions.
It (probably) makes sense on x86 because x86 cores were already spending a bunch of power on instruction decoding and the uop cache.
> Plus i would assume it makes the cost of miss-predict even higher.
It shouldn't increase the miss-predict cost by too much.
The new fetch address will bypass the branch-prediction queue and feed directly into one of the three decoders. And previous implementations already have a uop queue between the decoder and re-name/dispatch. It gets flushed and the first three uops should be able to cross it in a single cycle.
At least, it's common to have multi-level branch predictors that take a variable number of cycles to return a result, and it makes a lot of sense to queue up predictions so they are ready when the decoder gets to that point.
But I doubt the idea of parallel decoders makes any sense out side of x86's complex variable length instructions.
It (probably) makes sense on x86 because x86 cores were already spending a bunch of power on instruction decoding and the uop cache.
> Plus i would assume it makes the cost of miss-predict even higher.
It shouldn't increase the miss-predict cost by too much.
The new fetch address will bypass the branch-prediction queue and feed directly into one of the three decoders. And previous implementations already have a uop queue between the decoder and re-name/dispatch. It gets flushed and the first three uops should be able to cross it in a single cycle.