This is brilliant stuff, and has held up very well. The PRAM model seems a little dated by modern standards (it doesn't accurately model actual any parallel computer we'd want to build today), but the insight that there is a lot of parallelism and a lot of flexibility about how to extract that parallelism remains solid. Prefix sum (or parallel scan) is one of the core algorithms that works well on GPU.
Another great thing about this work is showing how flexible the generalizations of "sum" can be. A lot of problems can be restated in terms of an associative operator, and doing so unlocks all that parallel goodness.
Basically a GPU, as that's the highly parallel machine you're most likely to program any time soon. I'm not sure there's anything quite like PRAM in the sense that it's a mathematically abstract model that nonetheless captures something interesting about performance, but Volkov's thesis is a pretty good start for modeling both the memory and ALU cost of a GPU computation:
Parallel prefix sum is the most underappreciated parallel algorithm in my opinion, and this paper is the best explanation and visualization of the concept I've seen.
A few years ago I worked on a deep learning project using parallel prefix sum as a new way to accelerate recurrent neural nets on GPUs[0]. The paper in this post was the most important reference and source of inspiration. I'm happy to see this paper shared on HN in hopes that it also sparks ideas in others.
I used to program on the connection machine. I tried to do some work recently with the intel vector instructions and was quite frustrated by the lack of scans. we used them for _everything_
I implemented NVIDIA’s parallel prefix scan as an Apple Metal compute shader a while back, for this GPU based ‘blobby’ thing. [1]
The marching cubes algorithm I was implementing generated a lot of sparse datasets, and I wanted to send only the vertices that were going to be rendered. Got a lot of speedup when all was said and done — and learning this algorithm, and the parallel way of thinking in general, was eye opening.
As noted in a previous discussion [0], the first example contains a typo: the sequence "[3 4 11 11 14 16 22 25]" should be "[3 4 11 11 15 16 22 25]" (not a big deal, but it was enough to make me wonder if I wasn't understanding the algorithm)
Thanks to the timeless nature of the PDF format (and the supporting applications), I'm able to read a digital content created 3 decades ago. What also blows my mind is that virtually everything about computers has changed over these decades and yet, here I'm reading this brilliant paper.
There is something about the PDF format that gives me real joy just to look at the beautifully rendered content -- letters and pictures. For me, no other format comes close to PDF. In the physical world Springer books are a good match but that's probably because they are typeset in PDF too.
Another great thing about this work is showing how flexible the generalizations of "sum" can be. A lot of problems can be restated in terms of an associative operator, and doing so unlocks all that parallel goodness.