Maybe I’m misunderstanding why branching is slow on a GPU. My understanding was ...

dahart · 2025-02-10T15:44:10 1739202250

The primary concern is usually over the masking you’re talking about, the issue being simply that you’re proportionally cutting down the number of threads doing useful work. Using Nvidia terminology, if only one thread in a warp is active during a branch, the GPU throughput is 32x slower than it could be with a full warp.

Not all GPU branches are compiled in a straight line without jumps, so branching on a GPU does sometimes share the same instruction cache churn that the CPU has. That might be less of a big deal than thread masking, but GPU stalls still take dozens of cycles. And GPUs waiting on memory loads, whether it’s to fill the icache or anything else, are up to 32x more costly than CPU stalls, since all threads in the warp stall.

Loops are just normal branches, if they’re not unrolled. The biggest question with a loop is will all threads repeat the same number of times, because if not, the threads that exit the loop have to wait until the last thread is done. You can imagine what that might do for perf if there’s a small number of long-tail threads.