Cool, this definitely seems like a good enumeration of techniques, nice to see that they discuss stuff like kernel fission as well. Having a good understanding of loop nest optimization transformations (tiling, fission, fusion, strip mining/sinking, iteration order changes, etc) provides a good vocabulary for talking about this stuff too.
As someone who has spent 80%+ of their time CUDA programming for the past 9 years (I wrote the original GPU PyTorch tensor library, the Faiss GPU library, and several things that Nvidia took and put into cuDNN), I found the most instructive, short yet "advanced" education on the subject to be Paulius Micikevicius' various slide decks on "Peformance Optimization"; e.g.:
Do you have any career advice for someone deeply interested in breaking into high performance GPU programming? I find resources like these, and projects like OpenAIs Triton compiler or MIMD-on-GPU so incredibly interesting.
But I have no idea who employs those skills! Beyond scientific HPC groups or ML research teams anyway - I doubt they’d accept someone without a PhD.
My current gameplan is getting through “Professional CUDA C programming” and various computer architecture textbooks, and seeing if that’s enough.
Given that CUDA main focus is C++ since CUDA 3.0, ignoring the other PTX sources for now, not sure if that 2014 book is the right approach to learn CUDA.
Can you elaborate a bit on how C++ affects the programming model? Isn't CUDA just a variant of C? I presume it is not the goal to run standard C++? Also as I understand it PTX is an IR so not sure why C/C++ can be compared?
Not at all, unless we are speaking of CUDA until version 3.0.
CUDA is a polyglot programming model for NVidia GPU, with first party support for C, C++, Fortran, and anything else that can target PTX bytecode.
PTX allows for many other languages with toolchains to also target CUDA in some form, with .NET, Java, Haskell, Julia, Python having some kind of NVidia sponsored implementations.
While originally CUDA had its own hardware memory model, NVidia decided to make it follow C++11 memory semantics and went through a decade of hardware redesign to make it possible.
- CppCon 2017: Olivier Giroux "Designing (New) C++ Hardware”
This is why OpenCL kind of lost the race, with it focused too much in its C dialect, only going polyglot when it was too late for the research community to care.
To ones who are interested: "Programming Massively Parallel Processors: A Hands-on Approach" is a great book to learn CUDA programming, and it talks mostly about performance because, after all, GPU is about speed.
Unlike normal programming books, it talks a lot about how GPUs work and how the introduced techniques fit in that picture. It's interesting even if you are just curious how a (NVIDIA) GPU works at code-level. Strongly recommended.
I bought the first edition when it came out, and definitely it was a gold mine of information on the subject. I wonder though, is the fourth edition worth buying another copy? Nvidia has been advancing CUDA, in particular moving more towards C++ in the kernel language. But none of that was present when this book came out in 2007. Now more and more stuff is happening at thread block level with the cooperative group C++ API and warp level for tensor cores. It would be great if the authors revisited all the early chapters to modernize that content, but that's a lot of work so I don't usually count on authors making such an effort for later editions.
I also read the older edition and got the 4th for the second read recently.
I felt that the updated coverage is more on the GPU side than the language side.
It covers new GPU features and architectures well. I don't think it covers Tensor core things. But I might be wrong.
So it's worth the update if you're interested in general NVIDIA GPU evolution.
it's true - out of all of the "LEARN CUDA IN 24 HOURS" books, this is the best one. indeed this isn't one of those same books - this is a textbook - but at first glance it resembles them (at least the color scheme and the title led me astray when i first found it).
Does anybody have an idea on how to get in to Metal programming (as in Apple Metal)? I'd love to mess around a little with this on iOS and macOS while learning about tile-based rendering, but I have trouble locating educational written material.
There's a book (https://metalbyexample.com/the-book/), but the author has put up a note that it's quite out of date. It seems the most up-to-date information is available in the WWDC videos (regarding e.g. Metal 3), but I'd really prefer something written. And Apple's documentation reads more like a reference material and is quite confusing when starting out.
(+1) I'm a newb to Metal myself, and I wanted to use Swift as the driving language (which was a main selling point). Unfortunately, almost all the material is in Objective C.
If people like GPU programming, I wrote a blog post this week about GPU-accelerated hashmaps, semi-provocatively titled "Can we 10x Rust hashmap throughput?".
I've been looking into getting into GPU programming, starting with CS334 (https://developer.nvidia.com/udacity-cs344-intro-parallel-pr...) on Udacity. I'm curious to hear from some of the more seasoned GPU veterans out there, what other resources would be good to take a look at after finishing the videos and assignments?
If you want to go really in-depth I can recommend GTC on demand. It's Nvidia streaming platform with videos from past GTC conferences. Tony Scuderio had a couple of videos on there called GPU memory bootcamp that are among the best advanced GPU programming learning material out there.
100% this. You can find all kinds of detailed topics, like CUDA graphs, memory layout optimization, optimizing storage access, etc. https://www.nvidia.com/en-us/on-demand/. They have "playlists" for things like HPC or development tools that collect the most popular videos on those topics.
Partly related I believe so perhaps someone can help. Whole theses have been written on prefix sum algorithms, and I never got it. Perhaps someone kind can give some convincing examples of their advantages.
Not speaking to their implementation, but prefix sums/scans are simply a very useful primitive tool for parallelizing many otherwise sequential operations. For instance, appending a variable number of items per worker to a shared coalesced buffer uses an exclusive prefix sum. This is probably the most common use case for them in practical programming. They can also be used to partition work across parallel workers (segmented prefix scans).
In lieu of pointer chasing, hashing and the like, parallel operations on flat arrays are the way to maximize GPU utilization.
It's used in one of the fastest sorting approaches - counting sort / binning - to compute the location of where to store the sorted/binned items. First you count the number of items per bin, then you use prefix-sums to compute the memory location of each bin, then you insert the items into the respective bins. Some radix-sort implementations also utilize counting sort under the hood, and therefore prefix-sums. (Not sure if all radix-sort implementations need it)
It's incredibly useful if you have many threads that produce a variable number of outputs. Imagine you're implementing some filtering operation on the GPU, many threads will take on a fixed workload and then produce some number of outcomes. Unless we take some precautions, we have a huge synchronization problem when all threads try to append their results to the output. Note that GPUs didn't have atomics for the first couple of generations that supported CUDA, so you couldn't just getAndIncrement an index and append to an array. We could store those outputs in a dense structure, allocating a fixed number of output slots per thread, but that would leave many blanks in between the results. Now once we know the number of outputs per thread we can use a prefix sum to let every thread know where they can write their results in the array.
The outcome of a prefix sum exactly corresponds with the "row starts" part of the CSR sparse matrix notation. So they are also essential when creating sparse matrices.
Interesting timing on posting this to HN, I've recently been optimizing my WebGPU LSD radix sort. Today I measured it against the Thrust CUDA version, and it's about 10x slower (15ms to 1.5ms). My goal was to try to get 10 million elements in 1 ms, but now that I know 3 million in 1.5ms is impossible even for Thrust I know I won't be able to beat that.
I haven't tried WebGPU yet, is there an overall performance hit compared to direct CUDA programming?
AFAIK Thrust is intended to simplify GPU programming. It could well be that for specific use cases, in particular when it is possible to fuse multiple operations into single kernels, you could outperform Thrust.
There is definitely at least a performance hit in that wgpu (and I think WebGPU too) only supports a single queue. That means you can't asynchronously run compute tasks while running render tasks.
Additionally Wgpu (the library) will insert fences between all passes that have a read-write dependency on a binding, even if there is technically no fence needed as 2 passes might not access the same indices.
Finally I know that there is an algorithm called decoupled look back that can speed up prefix sums, but it requires a forward-progress guarantee. All recent NVIDIA cards can run it but I don't think AMD can, so WebGPU can't in general. Raph Levien has a blog post on the subject https://raphlinus.github.io/gpu/2021/11/17/prefix-sum-portab...
Humble self-promo here, may I also recommend the team at CentML who dedicated their academic life (PhD and above) to GPU optimizations for high-performance ML/AI to lower the costs.
As someone who has spent 80%+ of their time CUDA programming for the past 9 years (I wrote the original GPU PyTorch tensor library, the Faiss GPU library, and several things that Nvidia took and put into cuDNN), I found the most instructive, short yet "advanced" education on the subject to be Paulius Micikevicius' various slide decks on "Peformance Optimization"; e.g.:
https://on-demand.gputechconf.com/gtc/2013/presentations/S34...
(there are some other ones outstanding, think one was for the Volta architecture as well)
They're old but still very relevant to today's GPUs.