It's incredibly useful if you have many threads that produce a variable number o...

It's incredibly useful if you have many threads that produce a variable number of outputs. Imagine you're implementing some filtering operation on the GPU, many threads will take on a fixed workload and then produce some number of outcomes. Unless we take some precautions, we have a huge synchronization problem when all threads try to append their results to the output. Note that GPUs didn't have atomics for the first couple of generations that supported CUDA, so you couldn't just getAndIncrement an index and append to an array. We could store those outputs in a dense structure, allocating a fixed number of output slots per thread, but that would leave many blanks in between the results. Now once we know the number of outputs per thread we can use a prefix sum to let every thread know where they can write their results in the array.

The outcome of a prefix sum exactly corresponds with the "row starts" part of the CSR sparse matrix notation. So they are also essential when creating sparse matrices.