> Are there guidelines on how complex a program you can run on GPU cores and still reap benefit?
Arbitrarily complex. Many real-world GPGPU programs are split over multiple compute shaders / kernels. When implemented properly, they even run in parallel. This is critically important on Windows with DirectCompute, because 2 seconds TDR timeout. When a single compute shader takes more than 2 seconds, the OS concludes the GPU’s hang, resets the hardware and reloads the driver.
> What are (vague, order of magnitude) limits on program size and size of the different memories you have access to?
Technically unlimited because you can stream data from system memory, disk, or even internets. Practically, most modern GPUs come with 6-12 GB VRAM, for optimal performance you’d want your data to fit there. When doing micro-optimizations, another important number is amount of on-chip memory, the order of magnitude is 64kb per core.
> Can you do monte carlo simulations, decision trees etc?
Monte Carlo is a great fit for GPUs, just don’t forget to seed your RNG with SV_DispatchThreadID or an equivalent.
Trees are tricky. GPUs share instruction decoder and instruction pointer over 32-64 hardware threads. A straightforward implementation of binary trees gonna be suboptimal due to divergence of these threads. Often possible to do something else instead.
> Trees are tricky. GPUs share instruction decoder and instruction pointer over 32-64 hardware threads. A straightforward implementation of binary trees gonna be suboptimal due to divergence of these threads. Often possible to do something else instead.
The GPU-straightforward implementation is to use a SIMD-stack / SIMD-queue. I don't know its proper term, but its "obvious" to anyone who programs GPUs.
The following is probably wrong, but hopefully correct enough to demonstrate the idea...
active-lane-workgroup(){
wavefront_activelane = prefix-sum(execution-mask);
// Execution mask is 0 if not currently executing, and 1 if currently executing in your wavefront.
// wavefront_activelane is available at assembly level in one clock tick in both NVidia and AMD GPUs.
__shared__ int workgroup_prefix_sum[wavefronts_in_workgroup];
if(wavefront_activelane == 0){
workgroup_prefix_sum[wavefront_id] = wavefront_horizontal_max(wavefront_activelane);
}
__syncthreads();
return workgroup_prefix_sum[my_wavefronts_idx] + wavefront_activelane;
}
SIMD-push(stack, data):
stack[stack.ptr + active-lane-workgroup()] = data;
__syncthreads();
if(active-lane-workgroup() == 0){
stack.ptr += workgroup-prefix-max(active-lane-workgroup());
}
__syncthreads();
SIMD-pop(stack):
toReturn = stack[stack.ptr - active-lane-workgroup()];
__syncthreads();
if(active-lane-workgroup() == 0){
stack.ptr += workgroup-prefix-max(active-lane-workgroup());
}
__syncthreads();
return toReturn;
------
Now that you have a stack, its simply a matter of pushing your DFS into the SIMD-stack, and popping it off to traverse.
You're somewhat BFS, because on a 1024-wide workgroup, your threads will visit the top 1024 items of the stack each step. But the stack overall behaves in a DFS manner.
The CUDA cub library (and ROCm hipCUB / ROCm rocPRIM libraries) implement these horizontal operations (glorified prefix-sum). But its not too hard to write workgroup prefix-sum or workgroup prefix-max yourself. (Indeed: I suggest beginners write their own prefix-sum to get a feel of how simple and efficient the prefix sum / scan operations can be)
--------
The "pattern" is that you use horizontal operations and __syncthreads() for thread synchronization and communication. In effect, this SIMD-stack is performing load-balancing simultaneously with the DFS. That is to say, it "reaches breadth-wise" and pulls extra nodes to visit when more lanes are available, while it prefers depth to minimize memory usage.
> The GPU-straightforward implementation is to use a SIMD-stack / SIMD-queue
I agree. Haven’t personally did that exact thing because wave intrinsics require D3D 12.2 (or new enough CUDA like in your example), but I did similar stuff with group shared memory which is more compatible, only requires feature level 11.0 hardware.
However, that GPU-straightforward implementation is not that straightforward for programmers with CPU background starting to use GPUs. Even simpler things like reduction (dot product, or matrix*vector in linear algebra) are rather tricky to implement efficiently on GPUs, very different than even manually-vectorized SIMD CPU code.
Oh, and CUDA's "cooperative groups" are a nice abstraction for this "Execution-mask" handling.
But if you read older PRAM stuff from the 1980s or 1990s, they talk about "execution-masks" and manipulating it directly like this. So its helpful to know how this older technique relates to modern APIs like cooperative groups.
Pretty much everything compute-bound. SpaceX presentation about rocket engines: https://www.youtube.com/watch?v=vYA0f6R5KAI Unreal Engine 5 presentation about rendering many millions of small triangles using compute shaders: https://www.youtube.com/watch?v=TMorJX3Nj6U
> Are there guidelines on how complex a program you can run on GPU cores and still reap benefit?
Arbitrarily complex. Many real-world GPGPU programs are split over multiple compute shaders / kernels. When implemented properly, they even run in parallel. This is critically important on Windows with DirectCompute, because 2 seconds TDR timeout. When a single compute shader takes more than 2 seconds, the OS concludes the GPU’s hang, resets the hardware and reloads the driver.
> What are (vague, order of magnitude) limits on program size and size of the different memories you have access to?
Technically unlimited because you can stream data from system memory, disk, or even internets. Practically, most modern GPUs come with 6-12 GB VRAM, for optimal performance you’d want your data to fit there. When doing micro-optimizations, another important number is amount of on-chip memory, the order of magnitude is 64kb per core.
> Can you do monte carlo simulations, decision trees etc?
Monte Carlo is a great fit for GPUs, just don’t forget to seed your RNG with SV_DispatchThreadID or an equivalent.
Trees are tricky. GPUs share instruction decoder and instruction pointer over 32-64 hardware threads. A straightforward implementation of binary trees gonna be suboptimal due to divergence of these threads. Often possible to do something else instead.