There seems to be a very common misconception about branch prediction, that its only job is to predict the direction of the branch.
In reality, the problem is so much deeper. The instruction fetch stage simply can't see the branch at all. Not just conditional branches, but unconditional jumps, calls and even returns too.
Even a simple 5 stage "classic RISC" pipeline takes a full two cycles to load the instruction from memory and decode before it can see it, and your instruction fetch stage has already fetched two incorrect instructions (though many RISC implementations cheat with an instruction cache fetch that takes half a cycle, and then adding a delay slot).
In one of these massive out-of-order CPUs, the icache fetch might take multiple cycles, (then length decoding on x86), so it might take 4 or 5 cycles before the instruction could possibly be decoded. And if you are decoding 4 instructions per cycle, that's 20 incorrect instructions fetched from icache.
To actually continue fetching without any gaps, the branch predictors needs to predict:
1. The location of the branch
2. The type of branch, and (for conditional branches) if it's taken or not.
follow up question: if the branch is predicted to not be taken, why does the predictor have to use resources to record its location and the destination?
These predictors change their prediction (both direction and destination) based on the history of the last few hundred branches and if they were taken or not-taken. So the predictor needs to know where those branches were, even if they aren't taken.
Indirect TAGE predictors are very powerful. They can correctly predict jump tables and virtual function calls.
In general, branch predictors don't utilise their tables very efficiently. Cheap and fast lookups are way more important than minimising size.
In reality, the problem is so much deeper. The instruction fetch stage simply can't see the branch at all. Not just conditional branches, but unconditional jumps, calls and even returns too.
Even a simple 5 stage "classic RISC" pipeline takes a full two cycles to load the instruction from memory and decode before it can see it, and your instruction fetch stage has already fetched two incorrect instructions (though many RISC implementations cheat with an instruction cache fetch that takes half a cycle, and then adding a delay slot).
In one of these massive out-of-order CPUs, the icache fetch might take multiple cycles, (then length decoding on x86), so it might take 4 or 5 cycles before the instruction could possibly be decoded. And if you are decoding 4 instructions per cycle, that's 20 incorrect instructions fetched from icache.
To actually continue fetching without any gaps, the branch predictors needs to predict:
1. The location of the branch
2. The type of branch, and (for conditional branches) if it's taken or not.
3. The destination of the branch