*I mean, GPUs are only SIMD for 32 lanes (Nvidia or AMD RDNA) or 64 lanes (AMD C...

dragontamer · on Aug 30, 2021

> Not sure what you mean here. Are there places where different groups of kernels can simultaneous execute different code?

Yes. Lets take AMD's GCN / Vega, since I'm most familiar with it.

The Vega 64 has 64 "compute units" (CU for short), where a CU is the closest thing to a "core" that the Vega GPU has. So lets really look at how one of these CUs functions.

1. A kernel in GCN executes 64-wide SIMD assembly language. This is the programmer model, but this is not what's going on under the hood. This 64-wide SIMD is executed every 4 clock ticks, 16-at-a-time.

2. The CU has 4x16 groups of vALUs (where a vALU is a set of 64-wide registers and 16-wide arithmetic units). The CU also has 1x group of sALU ("scalar ALU" has 32-bit registers). A kernel has access to upto 256 vGPRs (each executing in SIMD fashion: 16 wide over 4 clockticks for the 64 lanes) + 103 sGPRs, which are shared. (For example: a function call is an sGPR. Because function calls are "shared" between all 64-lanes, its more efficiently implemented in sGPRs than vGPRs).

------------

There can be up to 40-simultaneous instruction pointers being executed by the AMD GCN GPU. Aka: occupancy 40. The exact instructions will be switching between sALUs (ex: branching instructions, function calls, push / pop from the stack), and vALUs (ex: multiply-and-accumulate, which will be SIMD-executed 64x in parallel over 4 clock ticks). (Really: each vALU has up to 10 instruction pointers that its tracking)

As we can see, the modern GPU isn't SIMD at all. Its executing multiple instructions, from multiple different instruction pointers (possibly different kernels even) in parallel.

Even at a minimum number of threads... there's 4-wavefronts per CU getting executed (4-instruction pointers, one for each vALU). That's MIMD in my opinion, since there's no smaller unit than the CU in the whole of AMD Vega.

Just because its executing SIMD kernels / SIMD assembly code doesn't mean that the underlying machine is SIMD.

---------

NVidia is similar: except with 32-wide SIMD programming model and 32-occupany per SM (though these magic numbers seem to change every generation) NVidia is also beginning to implement superscalar SIMD (2-instructions per clock tick, if those instructions go to different execution units).

So I'd also classify modern NVidia as a MIMD machine.

--------

CPUs of course have superscalar units: something like 4-way decode and something like 6+ instructions per clock tick (if executing out of the uop cache. 4-instructions per clock tick otherwise).

atq2119 · on Aug 30, 2021

Your description fits with my understanding, but I'd draw a slightly different conclusion out of it. As I understand it, the terms SIMD/MIMD/etc. are meant to refer to the assembly instructions used to program the chip, and from that perspective, AMD GCN is very clearly SIMD. It's just that each "core" (CU) runs up to 40x SMT...

dragontamer · on Aug 30, 2021

Each "core" runs 4-wavefronts (aka: instruction pointers) per clock tick, which is MIMD.

If you have fewer than 4-wavefronts running at any given time, you've got an underutilized CU. That's 256-SIMD lanes you need at a minimum to fully utilize any CU from an AMD Vega.

They happen to scale up to 10x wavefronts per vALU (aka: 40-wavefronts maximum). At least, if your kernels use few enough registers / __shared__ memory. But even if we remove all the SMT-like features in an AMD Vega CU, we still have 4x instruction pointers being juggled by the underlying system.

Which should be MIMD by any sane definition. (4 instruction pointers is MI or "multiple instructions", each of which is a SIMD instruction, so MD is also happening).

---------

Note that SMT / Hyperthreading seems to be considered MISD in Flynn's original 1966 paper. So SMT + SIMD == MIMD, in my opinion. These modern GPUs didn't exist back in 1966, so we don't know for sure how Flynn would have categorized today's computers.

But honestly, I'd say its somewhat insane to be reading a paper/organization scheme from 1966 and trying to apply its labels on computers and architectures invented 50-years later.

joe_the_user · on Aug 30, 2021

Well, for the purpose of designing and running otherwise arbitrary algorithms/code, it seems like four independent SIMD core/chips (with 1000/many kernels each) is quite different from a 1000 thousand cores each running instructions independently. The latter offers more options.

dragontamer · on Aug 30, 2021

There are 64 CUs per AMD Vega (a GPU from 2017 mind you: top end back then but just kinda middle-of-the-road these days).

So each CU runs natively 4-wavefronts (and scales to 40-wavefronts as resources allow). Each wavefront is a 64-wide SIMD, with 256-SIMD lanes total running per CU.

That's 16,384 "threads" of execution on a 4-year old GPU before all cores are "lit up" and utilized... with the option to have up to 163,840 "threads" at max 40-way occupancy (useful if you have low register usage per kernel and a dependency on something that's high-latency for some reason). These are mapped into 4096 physical SIMD-lanes / "threads" that physically execute per clock tick.

---------

At any given time, there are only 256 instruction pointers actually being executed. Which is where and how a GPU manages to be efficient (but also the weakness of a GPU: why it has issues with "branch divergence").

EDIT: The general assumption of the SIMD model of computers (like GPUs) is that line#250 is probably going to be executed by many, many "threads". So these "threads" instead become SIMD lanes, and are batched together to execute line #250 all together, saving power on decoding, instruction pointer tracking, the stack, etc. etc.

As long as enough threads are doing similar things all together, its more efficient to batch them up as a 64-wide AMD wavefront or 32-wide NVidia block. The programmer must be aware of this assumption. However, the underlying machine still has gross amounts of flexibility in terms of how to implement it. So it could be an MIMD machine under the hood, even if the programming model is SIMD.

-----

There's also the issue that when you KNOW 64-lanes are working together, you have assurances of where the data is. Things like bpermute and permute instructions can exist (aka: shuffle the bytes between the lanes) because you know all 64 lanes are on the same line of code. So in practice, cross-thread collaboration (such as prefix-sum, scan operations, compress, expand...) are more efficient on SIMD model than the equivalent mutex/atomic/compare-and-swap style programming of CPUs.

Something like a Raytracer (bounce these 64-simulated light paths around), and organizing which ones go where (compress into hit_array vs compress into miss_array) are fundamentally more efficient on the GPU/SIMD style programming, than the heavy CPU-based threaded model.

joe_the_user · on Aug 31, 2021

Thanks for the at-length reply.

I should make it clearer - I'm looking to run that's data-intenive, structurally MIMD but with some SIMD aspects. I'm trying to figure out the most cost-effective chip with which to do this.

The other question is whether chips like Esperanto's product have effective primitive for reduce operations and how much general memory through-put they have compared to a GPU.