> are there other good applications the GPU pipeline tends to be useful for? Pre...

dragontamer · on Dec 1, 2021

> Trees are tricky. GPUs share instruction decoder and instruction pointer over 32-64 hardware threads. A straightforward implementation of binary trees gonna be suboptimal due to divergence of these threads. Often possible to do something else instead.

The GPU-straightforward implementation is to use a SIMD-stack / SIMD-queue. I don't know its proper term, but its "obvious" to anyone who programs GPUs.

The following is probably wrong, but hopefully correct enough to demonstrate the idea...

    active-lane-workgroup(){
        wavefront_activelane = prefix-sum(execution-mask);      
        // Execution mask is 0 if not currently executing, and 1 if currently executing in your wavefront. 
        // wavefront_activelane is available at assembly level in one clock tick in both NVidia and AMD GPUs.
        __shared__ int workgroup_prefix_sum[wavefronts_in_workgroup];
        if(wavefront_activelane == 0){
            workgroup_prefix_sum[wavefront_id] = wavefront_horizontal_max(wavefront_activelane);
        }
        __syncthreads();
        return workgroup_prefix_sum[my_wavefronts_idx] + wavefront_activelane;
    }

    SIMD-push(stack, data):
        stack[stack.ptr + active-lane-workgroup()] = data;
        __syncthreads();
        if(active-lane-workgroup() == 0){
            stack.ptr += workgroup-prefix-max(active-lane-workgroup());
        }
        __syncthreads();


    SIMD-pop(stack):
        toReturn = stack[stack.ptr - active-lane-workgroup()];
        __syncthreads();
        if(active-lane-workgroup() == 0){
            stack.ptr += workgroup-prefix-max(active-lane-workgroup());
        }
        __syncthreads();
        return toReturn;

------

Now that you have a stack, its simply a matter of pushing your DFS into the SIMD-stack, and popping it off to traverse.

You're somewhat BFS, because on a 1024-wide workgroup, your threads will visit the top 1024 items of the stack each step. But the stack overall behaves in a DFS manner.

The CUDA cub library (and ROCm hipCUB / ROCm rocPRIM libraries) implement these horizontal operations (glorified prefix-sum). But its not too hard to write workgroup prefix-sum or workgroup prefix-max yourself. (Indeed: I suggest beginners write their own prefix-sum to get a feel of how simple and efficient the prefix sum / scan operations can be)

--------

The "pattern" is that you use horizontal operations and __syncthreads() for thread synchronization and communication. In effect, this SIMD-stack is performing load-balancing simultaneously with the DFS. That is to say, it "reaches breadth-wise" and pulls extra nodes to visit when more lanes are available, while it prefers depth to minimize memory usage.

Const-me · on Dec 2, 2021

> The GPU-straightforward implementation is to use a SIMD-stack / SIMD-queue

I agree. Haven’t personally did that exact thing because wave intrinsics require D3D 12.2 (or new enough CUDA like in your example), but I did similar stuff with group shared memory which is more compatible, only requires feature level 11.0 hardware.

However, that GPU-straightforward implementation is not that straightforward for programmers with CPU background starting to use GPUs. Even simpler things like reduction (dot product, or matrix*vector in linear algebra) are rather tricky to implement efficiently on GPUs, very different than even manually-vectorized SIMD CPU code.

dragontamer · on Dec 1, 2021

Oh, and CUDA's "cooperative groups" are a nice abstraction for this "Execution-mask" handling.

But if you read older PRAM stuff from the 1980s or 1990s, they talk about "execution-masks" and manipulating it directly like this. So its helpful to know how this older technique relates to modern APIs like cooperative groups.