I've found SMT to give 20-25 percent performance improvement on parallel workloads using a lot of the same data, so threads shouldn't be polluting each others cache. Different users would obviously not get the same improvement.
I'm not sure 20-25 percent is even worth having SMT given that is' like going from 4 cores to 5.
> I'm not sure 20-25 percent is even worth having SMT given that is' like going from 4 cores to 5.
Can you think of another 25% bonus that costs nothing in terms of chip area? SMT doesn't add any new registers (the register files and retirement buffers are the same). SMT doesn't add cache, or pipelines, or anything.
SMT just requires the chip's resources to be partitioned up in some cases, so its a complication to add to a chip but doesn't seem to take up much area at all.
Normally to get a 25% bonus, you'd have to double the L2 cache or something similar (and as the 3x L3 cache tests from Microsoft proves for Milan-X, even 300% L3 cache isn't anywhere close to 300% faster, despite taking up significantly more die area... albeit cheap die area in the form of 3d-stacked chiplets but still, a lot more silicon is being thrown at the problem there... compared to nearly-zero-silicon in the case of SMT)
AMD Zen 2 has 180 registers in its 64-bit register file (an additional set of registers, also nearly 200ish, is also for the 256-bit vector YMM registers). The 16 architectural registers (RAX, RBX, etc. etc.) are mapped to these 180 registers for out-of-order execution.
SMT is simply two threads splitting this 180-register file between each other.
---------
That's how out-of-order execution is accomplished. The "older registers" are remembered so that future code can be executed.
In practice, instructions like "xor rax, rax" are implemented as "malloc-new-register from the file, then set new register to zero".
As such, all you need is the "architectural file", saying "Thread#0's RAX instruction-pointer #0x80084412 is really RegisterFile[25]", and "Thread#0's RAX instruction pointer #0x80084418 is RegisterFile[26]", and finally "Thread#1's RAX instruction pointer #9001beef is RegisterFile[27]".
That allows all 3 contexts to execute out-of-order and concurrently. (In this case, the decoder has decided that Thread#0 has two things to execute in parallel for some reason, while Thread#1 only has one RAX allocated in its stream)
The retirement queue puts everything back into the original, programmer intended order before anyone notices whats up. Different threads on different sections of code will in practice use up a different number of registers.
A "linked-list" loop, like "while(node) node = node->next;" will be sequential and use almost none of the register file (impossible to extract parallelism, even with all of our Out-Of-Order tricks we have today), meaning Thread#2 can SMT and maybe use all 180-registers in the file for itself.
-----
Side note: this is why I think Apple's M1 design is kind of insane/terrible. Apple's register file is something like 380+ registers or somewhere there-abouts (on an ARM architecture only showing 31 registers to the assembly language), but does NOT in fact use SMT.
That is to say: Apple expects a singular thread to actually use those 380+ registers (!!!), and not really benefit from additional threads splitting up that absurdly huge register file. I'm sure Apple has done their research / simulations or whatever, and maybe it works out for the programs they've benchmarked / studied. But... I have to imagine that implementing SMT onto the Apple chips would be one of the easier ways to improve multithreaded performance.
Great summary, though given the overall opinion of the M1's performance they must've done something right. Maybe it's due in part to ARM being less strict on memory ordering that'd allow the M1s to utilize those registers.
Its no surprise that a core with a register file sized at 300+ would outperform a core with a register file of size 180. Significantly more parallelism can be found.
Apple's main risk was deciding that such large register files were in fact useful to somebody.
For highly optimized workloads (e.g. x264 video encoding or raytracers) I've seen about the same 20-25% increase at most. On general heavy and diverse workloads the number is rarely ever above 5%, while the power consumption and the security concerns run off all the same. I personally disable SMT on everything I can.
I'm not sure 20-25 percent is even worth having SMT given that is' like going from 4 cores to 5.