It looks like the next generation of *mont cores will be as big and capable as Skylakes. With E cores like this, who needs P cores?
Also, Intel was definitely onto something with the split decoders IMO. The x86 instruction set hurts to decode 8-wide in a single thread, but most code is branchy and loopy, so you only hurt in this configuration if loops are really big. Tight loops come from the uop cache, and branchy code gets 3-way decoding.
> Tight loops come from the uop cache, and branchy code gets 3-way decoding.
First, there is no uop cache on the "mont" cores.
Second, Intel aren't decoding both sides of the branch.
That wouldn't actually help much, as modern branch predictors are correct well over 99% of the time. It would be a waste of silicon and power to have an extra decoder producing work which simply decoded most of the time, and an even bigger waste to have two extra decoders.
Intel's actual approach is way more clever; They run the branch predictor ahead of the decoders by at least 3 branches (probably more). The branch predictor can spit out a new prediction every cycle, and it just plops them on a queue.
Each of the three decoders pops a branch prediction off the queue and starts decoding there. At any time, all three decoders will each be decoding a different basic block. A basic block that the branch predictor has predicted that the program counter is about to flow through. The three decoders are leap frogging each other. The decoding of each basic block is limited to a throughput of three instructions per cycle, but Skymont is decoding three basic blocks in parallel.
The decoded uops get pushed onto three independent queues, and the re-namer/dispatcher merges these three queues back together in original program order before dispatching to the backend. Each decoder can only push three uops per cycle onto its queue, but the re-namer/dispatcher can pull them off a single queue at the rate of 9 uops per cycle. The other two queues will continue to fill up while one queue is being drained.
The branch prediction result will always land on an instruction boundary, so this design allows the three decoders to combine their efforts and maintain a throughput of 9 uops per cycle, as long as the code is branchy enough. It works on loops too, as far as I'm aware, intel doesn't even have a loop stream buffer on this design; The three decoders will be decoding the exact same instructions in parallel for loop bodies.
But Intel have a neat trick to make this work even on code without branches or loops. The branch predictor actually inserts fake branches into the middle of long basic blocks. The branch predictor isn't actually checking an address to see if it has a branch. Instead it predicts the gap between branches, and they simply have a limit for the size of those gaps. Looks like that limit for Skymont is 64 bytes (was previously 32 bytes for Crestmont)
Thank you for that explanation, I was confused as to what was happening with the multiple decoders. That's a wild way to implement a processor front end.
Oh, I thought the uop queues were uop caches when I looked at the diagram. Not having loop handling does seem off, but I guess with long loops they will just alternate between the decoders.
The whole 99% branch prediction thing is sort of misleading - most branches are loops taken for a constant amount of time, so most are perfectly predictable, and most others are error checks which are also easy to predict. However, a large amount of comparative wall time in code is spent on sequences of a short piece of code and a branch that is hard to predict. Without hyperthreading, I would assume that decoding both sides of the branch would actually help a lot in these circumstances. It sounds like Intel is possibly capable of doing that.
The synthetic basic blocks are also an interesting idea given how hard it is to figure out where an x86 instruction boundary is. It's easy to split a basic block when you have a branch going to that basic block, but if you just synthetically insert a split some distance down, you may be misaligned with the actual instruction stream. That can be self-synchronizing at points, but it's hard.
I do agree. The fact that those uop queues are already there and Intel isn't using them as a loop buffer does make me ask questions. Have they just not gotten around to it? Have they decided its not worth the power savings? Maybe they are aiming for simplicity?
> However, a large amount of comparative wall time in code is spent on sequences of a short piece of code and a branch that is hard to predict.
The thing is, any time the branch predictor has at least one correct prediction, the decode throughput doubles to 6 IPC. And if it gets two correct predictions in a row, the IPC triples to 9.
I'm not sure how many cycles the "execute both sides of the branch" would save on a misspredict, but your basic blocks would need to be very short and the prediction accuracy would need to be very low (like, 50% or lower) before it can actually out-preform the leapfrogging decoder approach on those sequences of code.
> but if you just synthetically insert a split some distance down, you may be misaligned with the actual instruction stream
It only inserts the splits after decoding, so they will always be at the correct alignment.
Branches that are hard to predict should be “hand-optimized” through specially written code, compiler intrinsics/annotations, or profile guided feedback to tell the compiler to emit the conditional using unconditional branchless instructions like cmov. Expecting a CPU to detect this at runtime may be asking it to do too much.
That necessitates executing both sides fully. Very often, those are "business logic" branches that are very long, and you would only prefer to cover branch mispredict penalty.
Yes you would only do this for hot loops that have short basic block branches (binary search being the canonical example). That’s why I said hand annotate vs having the CPU try to detect and distinguish these situations at runtime.
Yes, we're talking about different things. Those are able to be optimized in software doing what you mentioned. The branches I am talking about are not. They often sit in "business logic" code.
I fancy myself of having a good understanding of modern uarch. But i have to agree with @Marthinwurer. This branch predictor structure with parallel predictor and fake branch address is quite wild.
Do you know how this compare to what AMD/AppleM/Qualcom is doing ? This seems super effective, but seems pretty power hungry as opposed to just increasing the chase size and predictor precision. Plus i would assume it makes the cost of miss-predict even higher.
I'm pretty sure the patten of allowing the branch predictor to run ahead is pretty common.
At least, it's common to have multi-level branch predictors that take a variable number of cycles to return a result, and it makes a lot of sense to queue up predictions so they are ready when the decoder gets to that point.
But I doubt the idea of parallel decoders makes any sense out side of x86's complex variable length instructions.
It (probably) makes sense on x86 because x86 cores were already spending a bunch of power on instruction decoding and the uop cache.
> Plus i would assume it makes the cost of miss-predict even higher.
It shouldn't increase the miss-predict cost by too much.
The new fetch address will bypass the branch-prediction queue and feed directly into one of the three decoders. And previous implementations already have a uop queue between the decoder and re-name/dispatch. It gets flushed and the first three uops should be able to cross it in a single cycle.
Thanks, that's a nice explanation. I hadn't looked in details of how the multiple decoders in the *monts worked. Relying on branches and prediction to find the instruction boundaries is quite a nifty trick.
If the branch predictor to predict branches ahead it needs to know where the branches instructions are. Is there a mini decoder tasked to just decode the instruction stream just enough to handle the variable length instructions and figure out where the branches are? Or am I fundamentally misunderstanding how branch prediction works (which likely I am)?
There seems to be a very common misconception about branch prediction, that its only job is to predict the direction of the branch.
In reality, the problem is so much deeper. The instruction fetch stage simply can't see the branch at all. Not just conditional branches, but unconditional jumps, calls and even returns too.
Even a simple 5 stage "classic RISC" pipeline takes a full two cycles to load the instruction from memory and decode before it can see it, and your instruction fetch stage has already fetched two incorrect instructions (though many RISC implementations cheat with an instruction cache fetch that takes half a cycle, and then adding a delay slot).
In one of these massive out-of-order CPUs, the icache fetch might take multiple cycles, (then length decoding on x86), so it might take 4 or 5 cycles before the instruction could possibly be decoded. And if you are decoding 4 instructions per cycle, that's 20 incorrect instructions fetched from icache.
To actually continue fetching without any gaps, the branch predictors needs to predict:
1. The location of the branch
2. The type of branch, and (for conditional branches) if it's taken or not.
follow up question: if the branch is predicted to not be taken, why does the predictor have to use resources to record its location and the destination?
These predictors change their prediction (both direction and destination) based on the history of the last few hundred branches and if they were taken or not-taken. So the predictor needs to know where those branches were, even if they aren't taken.
Indirect TAGE predictors are very powerful. They can correctly predict jump tables and virtual function calls.
In general, branch predictors don't utilise their tables very efficiently. Cheap and fast lookups are way more important than minimising size.
Classic disrupting yourself from below. A little late now that ARM chips from both Apple and now Qualcomm (maybe?) have caught up, but the second best time is now.
I give it 2 more releases max before Intel drops heterogeneous cores and only ships the -Mont architecture.
Because, presumably, the P-cores are even beefier.
Intel and AMD are still trying to gain time on the slow march to ARM (particularly Apple) catching up. Both of their long term strategies seem to differ (AMD edging back into ARM itself, Intel being a little more close lipped), but they can't lose their one major edge (raw performance) or potentially more users switch to an x86-excluded (and, more importantly, third-party excluded) platform (Mac).
In all seriousness, the main advantage of the P cores is the wider vector datapath. They are much more set up for loopy "grunt work" like matrix math. Web serving probably doesn't need a P core, for example.
The ISA is much less important than many people seem to think. The RISC vs CISC debate is beyond outdated at this point because no modern architecture actually works strictly like either under the hood. Organizations who did x86 architectures historically had much more emphasis on performance while organizations who did ARM had more emphasis on low power devices. The lingering engineering consequences of that history and the experience of the organizations doing design are orders of magnitude more relevant than the difference in actual ISA.
ARM is special in that it's the only potential, realistic competitor to x86 left in the entire industry. RISC-V? Only if you're a zealot breathing fumes for life energy, at least as things stand today.
The interesting thing about RISC-V is that there are like 10 billion RISC-V microcontrollers out there that otherwise would have been ARM. So ARM has been moving into the PC/server space while RISC-V pushes in from behind, at the opposite end of the line from x86.
RISC-V is disrupting ARM's low end, the pace it's occupied safely for decades. As RISC-V matures, it will move upstream along the same path that ARM did. The difference is that it will be easier to move because the transition to ARM is demonstrating that companies don't need to be locked to a particular ISA.
It's so new that the Wikipedia page hasn't been written yet, it still redirects to the old usage of the codename as the previous name of Cannon Lake. Or it should redirect to a Lunar Lake page:
-ffast-mast does a whole bunch of things, which I wish people's didn't so commonly combine.
For example, I think the things it does which are sensible for most people are:
* Rounding subnormals to zero
* Disabling signed zeroes
* Disables support for 'trapping' (throwing SIGFPE)
Then there are the 'middle' things, which annoy some people:
* Allow associative operations, and things like sqrt(xy)=sqrt(x)sqrt(y), exp(x)*exp(y)=exp(x+y)
However, it also (which I often find break code) assumes no operation will make a NaN or an Infinity -- these last two don't really help, and also break code in confusing ways. This being gcc, they don't just change things like std::isnan or std::isinf into an 'abort' (which would make sense, in -ffast-math they don't make sense), they just return nonsense instead.
For me, the most important part is -fno-math-errno which allows the compiler to ignore that libm functions are allowed to set errno. This is perfectly safe (unless you rely on that rarely known side effect) and is usually the one flag I explicitly set.
And the primary benefit of doing that is so the compiler can inline math functions like sqrt() as a tiny number of instructions (on modern CPUs) instead of having to call the standard C function, which is much slower.
For sqrt() on x86_64 and gcc/clang, yes. But functions like fmod() are generally more instructions. And as for trig functions like sin(), AFAIK most compilers will always use a function call, because the x86 trig instructions don't have good speed/accuracy compared to a modern stdlib.
And YMMV when it comes to other arch's and compilers (and -fmath settings).
There's more. It also enables some limited (unsafe) rearrangement of your floating point expressions, as well as a few other settings inside the floating point system. Flushing denorms to 0 is only part of it.
It also will do things like using an approximate reciprocal square root instruction plus a refinement iteration instead of fsqrt then fdiv.
I also learned from experience that -ffast-math only enables the FTZ/DAZ optimization on the main thread, at least on Linux/X86. I don’t know if its universal or this has changed since I debugged it ~5-6 years ago, but that proved to be a bit hard to get to the bottom of since I immediately suspected the big CPU spike when the volume was set very low was caused by denormals, yet we were using the —ffast-math gcc flag.
I would start with a college/university course on the topic, there should be tons of slides and lectures available for free. And only after knowing basic concepts and common approaches I would look for some recent documents, probably they can be downloaded from Intel and other chip makers. But I would say that it all depends on where you are starting from and how deep you want to go.
Wait, did they only fix subnormals on the E-cores or did they fix them on the P cores also? It would be really weird if they only fixed this on the E-cores, but I haven’t seen anything saying that Redwood Cove fixed this issue.
The competitor would be the efficiency cores on the M-series chips. I don’t know how well they compare though. Apple doesn’t have any skus with only efficiency cores afaik. If they did it would be something like the Apple Watch, but since arm has had big.LITTLE architecture for many years there was no need to have chips with only efficiency cores to achieve efficiency.
Also, Intel was definitely onto something with the split decoders IMO. The x86 instruction set hurts to decode 8-wide in a single thread, but most code is branchy and loopy, so you only hurt in this configuration if loops are really big. Tight loops come from the uop cache, and branchy code gets 3-way decoding.