Intel details Skymont

pclmulqdq · 2024-06-18T00:55:49 1718672149

It looks like the next generation of *mont cores will be as big and capable as Skylakes. With E cores like this, who needs P cores?

Also, Intel was definitely onto something with the split decoders IMO. The x86 instruction set hurts to decode 8-wide in a single thread, but most code is branchy and loopy, so you only hurt in this configuration if loops are really big. Tight loops come from the uop cache, and branchy code gets 3-way decoding.

phire · 2024-06-18T03:42:59 1718682179

> Tight loops come from the uop cache, and branchy code gets 3-way decoding.

First, there is no uop cache on the "mont" cores.

Second, Intel aren't decoding both sides of the branch.

That wouldn't actually help much, as modern branch predictors are correct well over 99% of the time. It would be a waste of silicon and power to have an extra decoder producing work which simply decoded most of the time, and an even bigger waste to have two extra decoders.

Intel's actual approach is way more clever; They run the branch predictor ahead of the decoders by at least 3 branches (probably more). The branch predictor can spit out a new prediction every cycle, and it just plops them on a queue.

Each of the three decoders pops a branch prediction off the queue and starts decoding there. At any time, all three decoders will each be decoding a different basic block. A basic block that the branch predictor has predicted that the program counter is about to flow through. The three decoders are leap frogging each other. The decoding of each basic block is limited to a throughput of three instructions per cycle, but Skymont is decoding three basic blocks in parallel.

The decoded uops get pushed onto three independent queues, and the re-namer/dispatcher merges these three queues back together in original program order before dispatching to the backend. Each decoder can only push three uops per cycle onto its queue, but the re-namer/dispatcher can pull them off a single queue at the rate of 9 uops per cycle. The other two queues will continue to fill up while one queue is being drained.

The branch prediction result will always land on an instruction boundary, so this design allows the three decoders to combine their efforts and maintain a throughput of 9 uops per cycle, as long as the code is branchy enough. It works on loops too, as far as I'm aware, intel doesn't even have a loop stream buffer on this design; The three decoders will be decoding the exact same instructions in parallel for loop bodies.

But Intel have a neat trick to make this work even on code without branches or loops. The branch predictor actually inserts fake branches into the middle of long basic blocks. The branch predictor isn't actually checking an address to see if it has a branch. Instead it predicts the gap between branches, and they simply have a limit for the size of those gaps. Looks like that limit for Skymont is 64 bytes (was previously 32 bytes for Crestmont)

Marthinwurer · 2024-06-18T05:31:02 1718688662

Thank you for that explanation, I was confused as to what was happening with the multiple decoders. That's a wild way to implement a processor front end.

pclmulqdq · 2024-06-18T15:16:00 1718723760

Oh, I thought the uop queues were uop caches when I looked at the diagram. Not having loop handling does seem off, but I guess with long loops they will just alternate between the decoders.

The whole 99% branch prediction thing is sort of misleading - most branches are loops taken for a constant amount of time, so most are perfectly predictable, and most others are error checks which are also easy to predict. However, a large amount of comparative wall time in code is spent on sequences of a short piece of code and a branch that is hard to predict. Without hyperthreading, I would assume that decoding both sides of the branch would actually help a lot in these circumstances. It sounds like Intel is possibly capable of doing that.

The synthetic basic blocks are also an interesting idea given how hard it is to figure out where an x86 instruction boundary is. It's easy to split a basic block when you have a branch going to that basic block, but if you just synthetically insert a split some distance down, you may be misaligned with the actual instruction stream. That can be self-synchronizing at points, but it's hard.

phire · 2024-06-18T18:18:26 1718734706

> Not having loop handling does seem off

I do agree. The fact that those uop queues are already there and Intel isn't using them as a loop buffer does make me ask questions. Have they just not gotten around to it? Have they decided its not worth the power savings? Maybe they are aiming for simplicity?

> However, a large amount of comparative wall time in code is spent on sequences of a short piece of code and a branch that is hard to predict.

The thing is, any time the branch predictor has at least one correct prediction, the decode throughput doubles to 6 IPC. And if it gets two correct predictions in a row, the IPC triples to 9.

I'm not sure how many cycles the "execute both sides of the branch" would save on a misspredict, but your basic blocks would need to be very short and the prediction accuracy would need to be very low (like, 50% or lower) before it can actually out-preform the leapfrogging decoder approach on those sequences of code.

> but if you just synthetically insert a split some distance down, you may be misaligned with the actual instruction stream

It only inserts the splits after decoding, so they will always be at the correct alignment.

pclmulqdq · 2024-06-18T19:49:20 1718740160

50% is about as bad as you can get without code that is specifically pathological - that is random guessing.

vlovich123 · 2024-06-18T18:15:48 1718734548

Branches that are hard to predict should be “hand-optimized” through specially written code, compiler intrinsics/annotations, or profile guided feedback to tell the compiler to emit the conditional using unconditional branchless instructions like cmov. Expecting a CPU to detect this at runtime may be asking it to do too much.

pclmulqdq · 2024-06-18T19:36:41 1718739401

That necessitates executing both sides fully. Very often, those are "business logic" branches that are very long, and you would only prefer to cover branch mispredict penalty.

vlovich123 · 2024-06-18T23:58:21 1718755101

Yes you would only do this for hot loops that have short basic block branches (binary search being the canonical example). That’s why I said hand annotate vs having the CPU try to detect and distinguish these situations at runtime.

pclmulqdq · 2024-06-19T01:25:11 1718760311

Yes, we're talking about different things. Those are able to be optimized in software doing what you mentioned. The branches I am talking about are not. They often sit in "business logic" code.

vlovich123 · 2024-06-19T15:10:10 1718809810

Yeah but business logic code like that isn’t generally bounded by the misprediction penalty so it doesn’t matter.

soulbadguy · 2024-06-18T06:09:33 1718690973

I fancy myself of having a good understanding of modern uarch. But i have to agree with @Marthinwurer. This branch predictor structure with parallel predictor and fake branch address is quite wild.

Do you know how this compare to what AMD/AppleM/Qualcom is doing ? This seems super effective, but seems pretty power hungry as opposed to just increasing the chase size and predictor precision. Plus i would assume it makes the cost of miss-predict even higher.

phire · 2024-06-18T07:54:58 1718697298

I'm pretty sure the patten of allowing the branch predictor to run ahead is pretty common.

At least, it's common to have multi-level branch predictors that take a variable number of cycles to return a result, and it makes a lot of sense to queue up predictions so they are ready when the decoder gets to that point.

But I doubt the idea of parallel decoders makes any sense out side of x86's complex variable length instructions.

It (probably) makes sense on x86 because x86 cores were already spending a bunch of power on instruction decoding and the uop cache.

> Plus i would assume it makes the cost of miss-predict even higher.

It shouldn't increase the miss-predict cost by too much.

The new fetch address will bypass the branch-prediction queue and feed directly into one of the three decoders. And previous implementations already have a uop queue between the decoder and re-name/dispatch. It gets flushed and the first three uops should be able to cross it in a single cycle.

gpderetta · 2024-06-18T07:57:05 1718697425

It is actually probably cheaper than the alternative of attempting to decode at all possible instruction boundaries in parallel!

gpderetta · 2024-06-18T07:58:11 1718697491

Thanks, that's a nice explanation. I hadn't looked in details of how the multiple decoders in the *monts worked. Relying on branches and prediction to find the instruction boundaries is quite a nifty trick.

ithkuil · 2024-06-18T07:07:55 1718694475

If the branch predictor to predict branches ahead it needs to know where the branches instructions are. Is there a mini decoder tasked to just decode the instruction stream just enough to handle the variable length instructions and figure out where the branches are? Or am I fundamentally misunderstanding how branch prediction works (which likely I am)?

phire · 2024-06-18T07:43:12 1718696592

There seems to be a very common misconception about branch prediction, that its only job is to predict the direction of the branch.

In reality, the problem is so much deeper. The instruction fetch stage simply can't see the branch at all. Not just conditional branches, but unconditional jumps, calls and even returns too.

Even a simple 5 stage "classic RISC" pipeline takes a full two cycles to load the instruction from memory and decode before it can see it, and your instruction fetch stage has already fetched two incorrect instructions (though many RISC implementations cheat with an instruction cache fetch that takes half a cycle, and then adding a delay slot).

In one of these massive out-of-order CPUs, the icache fetch might take multiple cycles, (then length decoding on x86), so it might take 4 or 5 cycles before the instruction could possibly be decoded. And if you are decoding 4 instructions per cycle, that's 20 incorrect instructions fetched from icache.

To actually continue fetching without any gaps, the branch predictors needs to predict:

1. The location of the branch

2. The type of branch, and (for conditional branches) if it's taken or not.

3. The destination of the branch

ithkuil · 2024-06-18T08:04:15 1718697855

ok that makes much more sense how; thanks!

follow up question: if the branch is predicted to not be taken, why does the predictor have to use resources to record its location and the destination?

phire · 2024-06-18T09:53:40 1718704420

Intel are probably using a TAGE style predator along the lines of ITTAGE or COTTAGE from http://www.irisa.fr/caps/people/seznec/JILP-COTTAGE.pdf

These predictors change their prediction (both direction and destination) based on the history of the last few hundred branches and if they were taken or not-taken. So the predictor needs to know where those branches were, even if they aren't taken.

Indirect TAGE predictors are very powerful. They can correctly predict jump tables and virtual function calls.

In general, branch predictors don't utilise their tables very efficiently. Cheap and fast lookups are way more important than minimising size.

wffurr · 2024-06-18T13:15:41 1718716541

Classic disrupting yourself from below. A little late now that ARM chips from both Apple and now Qualcomm (maybe?) have caught up, but the second best time is now.

I give it 2 more releases max before Intel drops heterogeneous cores and only ships the -Mont architecture.

BearOso · 2024-06-18T13:55:01 1718718901

Agreed. We've seen this before. Pentium M -> Core.

deaddodo · 2024-06-18T02:01:35 1718676095

> With E cores like this, who needs P cores?

Because, presumably, the P-cores are even beefier.

Intel and AMD are still trying to gain time on the slow march to ARM (particularly Apple) catching up. Both of their long term strategies seem to differ (AMD edging back into ARM itself, Intel being a little more close lipped), but they can't lose their one major edge (raw performance) or potentially more users switch to an x86-excluded (and, more importantly, third-party excluded) platform (Mac).

pclmulqdq · 2024-06-18T16:08:30 1718726910

In all seriousness, the main advantage of the P cores is the wider vector datapath. They are much more set up for loopy "grunt work" like matrix math. Web serving probably doesn't need a P core, for example.

metadat · 2024-06-18T02:39:11 1718678351

Is ARM really that special? Why do you believe this is the case?

thunderbird120 · 2024-06-18T02:53:44 1718679224

The ISA is much less important than many people seem to think. The RISC vs CISC debate is beyond outdated at this point because no modern architecture actually works strictly like either under the hood. Organizations who did x86 architectures historically had much more emphasis on performance while organizations who did ARM had more emphasis on low power devices. The lingering engineering consequences of that history and the experience of the organizations doing design are orders of magnitude more relevant than the difference in actual ISA.

deaddodo · 2024-06-18T13:03:29 1718715809

ARM is simply the ISA the industry is pivoting to and where most forward investment is going.

I don't know where I implied it was special.

Dalewyn · 2024-06-18T05:22:56 1718688176

ARM is special in that it's the only potential, realistic competitor to x86 left in the entire industry. RISC-V? Only if you're a zealot breathing fumes for life energy, at least as things stand today.

gary_0 · 2024-06-18T06:22:57 1718691777

The interesting thing about RISC-V is that there are like 10 billion RISC-V microcontrollers out there that otherwise would have been ARM. So ARM has been moving into the PC/server space while RISC-V pushes in from behind, at the opposite end of the line from x86.

cptskippy · 2024-06-18T13:47:29 1718718449

RISC-V is disrupting ARM's low end, the pace it's occupied safely for decades. As RISC-V matures, it will move upstream along the same path that ARM did. The difference is that it will be easier to move because the transition to ARM is demonstrating that companies don't need to be locked to a particular ISA.

jorvi · 2024-06-18T21:45:37 1718747137

> or potentially more users switch to an x86-excluded (and, more importantly, third-party excluded) platform (Mac).

Or much more likely, Windows ARM.

dur-randir · 2024-06-18T16:47:03 1718729223

>With E cores like this, who needs P cores?

Based on their the current performance, anyone who need it.

skavi · 2024-06-18T05:36:01 1718688961

The two generation old Gracemont already beat Skylake [0]. Skymont can beat Raptor Cove [1] (The big core that was paired with Gracemont).

[0]: https://www.anandtech.com/show/16881/a-deep-dive-into-intels...

[1]: https://www.anandtech.com/show/21425/intel-lunar-lake-archit...

ls612 · 2024-06-18T04:59:55 1718686795

I did the math and even on current 13900k/14900k chips each E-core is roughly equivalent to a stock Skylake 6700k core.

kristianp · 2024-06-18T01:03:44 1718672624

So Skymont is the architecture of the Efficiency core of Lunar lake:

https://www.anandtech.com/show/21425/intel-lunar-lake-archit...

It's so new that the Wikipedia page hasn't been written yet, it still redirects to the old usage of the codename as the previous name of Cannon Lake. Or it should redirect to a Lunar Lake page:

https://en.wikipedia.org/?title=Skymont_(microarchitecture)&...

tedunangst · 2024-06-18T01:05:30 1718672730

TIL rounding denormals to zero is what -ffast-math actually does.

CJefferson · 2024-06-18T02:52:01 1718679121

-ffast-mast does a whole bunch of things, which I wish people's didn't so commonly combine.

For example, I think the things it does which are sensible for most people are:

* Rounding subnormals to zero

* Disabling signed zeroes

* Disables support for 'trapping' (throwing SIGFPE)

Then there are the 'middle' things, which annoy some people:

* Allow associative operations, and things like sqrt(xy)=sqrt(x)sqrt(y), exp(x)*exp(y)=exp(x+y)

However, it also (which I often find break code) assumes no operation will make a NaN or an Infinity -- these last two don't really help, and also break code in confusing ways. This being gcc, they don't just change things like std::isnan or std::isinf into an 'abort' (which would make sense, in -ffast-math they don't make sense), they just return nonsense instead.

clausecker · 2024-06-18T09:04:06 1718701446

For me, the most important part is -fno-math-errno which allows the compiler to ignore that libm functions are allowed to set errno. This is perfectly safe (unless you rely on that rarely known side effect) and is usually the one flag I explicitly set.

gary_0 · 2024-06-18T12:55:48 1718715348

And the primary benefit of doing that is so the compiler can inline math functions like sqrt() as a tiny number of instructions (on modern CPUs) instead of having to call the standard C function, which is much slower.

adgjlsfhk1 · 2024-06-18T15:56:42 1718726202

> as a tiny number of instructions

specifically 1

gary_0 · 2024-06-18T16:33:38 1718728418

For sqrt() on x86_64 and gcc/clang, yes. But functions like fmod() are generally more instructions. And as for trig functions like sin(), AFAIK most compilers will always use a function call, because the x86 trig instructions don't have good speed/accuracy compared to a modern stdlib.

And YMMV when it comes to other arch's and compilers (and -fmath settings).

CJefferson · 2024-06-18T13:49:49 1718718589

You are absolutely right, I forgot one of the most important things piled into -ffast-math!

kibwen · 2024-06-18T01:16:59 1718673419

Not sure if I should be relieved or concerned that even Ted Unangst doesn't know what -ffast-math actually does.

(For the record, it does a whole lot of terrible things: https://stackoverflow.com/questions/7420665/what-does-gccs-f... )

pclmulqdq · 2024-06-18T01:35:53 1718674553

There's more. It also enables some limited (unsafe) rearrangement of your floating point expressions, as well as a few other settings inside the floating point system. Flushing denorms to 0 is only part of it.

It also will do things like using an approximate reciprocal square root instruction plus a refinement iteration instead of fsqrt then fdiv.

sgerenser · 2024-06-18T04:06:52 1718683612

I also learned from experience that -ffast-math only enables the FTZ/DAZ optimization on the main thread, at least on Linux/X86. I don’t know if its universal or this has changed since I debugged it ~5-6 years ago, but that proved to be a bit hard to get to the bottom of since I immediately suspected the big CPU spike when the volume was set very low was caused by denormals, yet we were using the —ffast-math gcc flag.

svantana · 2024-06-18T10:44:18 1718707458

With clang on my intel mac it doesn't work at all, the only solution is to set those flags using _mm_setcsr().

gpderetta · 2024-06-18T08:04:53 1718697893

I think the linking with the flush-to-zero code nonsense has been moved to a separate flag in recent GCCs.

omoikane · 2024-06-18T17:57:02 1718733422