The x64 cores putting more hardware into the vector units and amdgpu changing fr...

adrian_b · 2024-09-07T17:58:25 1725731905

There is little relationship between the reasons that determine the width of SIMD in CPUs and GPUs, so there is no convergence between them.

In the Intel/AMD CPUs, the 512-bit width, i.e. 64 bytes or 16 FP32 numbers, matches the width of the cache line and the width of a DRAM burst transfer, which simplifies the writing of optimized programs. This SIMD width also provides a good ratio between the power consumed in the execution units and the power wasted in the control part of the CPU (around 80% of the total power consumption goes to the execution units, which is much more than when using narrower SIMD instructions).

Increasing the SIMD width more than that in CPUs would complicate the interaction with the cache memories and with the main memory, while providing only a negligible improvement in the energy efficiency, so there is no reason to do this. At least in the following decade it is very unlikely that any CPU would increase the SIMD width beyond 16 FP32 numbers per operation.

On the other hand, the AMD GPUs before RDNA had a SIMD width of 64 FP32 numbers, but the operations were pipelined and executed in 4 clock cycles, so only 16 FP32 numbers were processed per clock cycle.

RDNA has doubled the width of the SIMD execution, processing 32 FP32 numbers per clock cycle. For this, SIMD instructions with a reduced width of 32 FP32 have been introduced, but they are executed in one clock cycle versus the old 64 FP32 instructions that were executed in four clock cycles. For backwards compatibility, RDNA has kept 64 FP32 instructions, which are executed in two clock cycles, but these were not recommended for new programs.

RDNA 3 has changed again all this, because now sometimes the 64 FP32 instructions can be executed in a single clock cycle, so they may be again preferable instead of the 32 FP32 instructions. However it is possible to take advantage of the increased width of the RDNA 3 SIMD execution units also when using 32 FP32 instructions, if certain new instructions are used, which encode double operations.

So the AMD GPUs have continuously evolved towards wider SIMD execution units, from 16 FP32 before RDNA, to 32 FP32 in RDNA and finally to 64 FP32 in RDNA 3.

The distance from CPUs has been steadily increasing, there is no convergence.