Optimization Techniques for GPU Programming [pdf]

jhj · on Aug 10, 2023

Cool, this definitely seems like a good enumeration of techniques, nice to see that they discuss stuff like kernel fission as well. Having a good understanding of loop nest optimization transformations (tiling, fission, fusion, strip mining/sinking, iteration order changes, etc) provides a good vocabulary for talking about this stuff too.

As someone who has spent 80%+ of their time CUDA programming for the past 9 years (I wrote the original GPU PyTorch tensor library, the Faiss GPU library, and several things that Nvidia took and put into cuDNN), I found the most instructive, short yet "advanced" education on the subject to be Paulius Micikevicius' various slide decks on "Peformance Optimization"; e.g.:

https://on-demand.gputechconf.com/gtc/2013/presentations/S34...

(there are some other ones outstanding, think one was for the Volta architecture as well)

They're old but still very relevant to today's GPUs.

hazz99 · on Aug 10, 2023

Do you have any career advice for someone deeply interested in breaking into high performance GPU programming? I find resources like these, and projects like OpenAIs Triton compiler or MIMD-on-GPU so incredibly interesting.

But I have no idea who employs those skills! Beyond scientific HPC groups or ML research teams anyway - I doubt they’d accept someone without a PhD.

My current gameplan is getting through “Professional CUDA C programming” and various computer architecture textbooks, and seeing if that’s enough.

pjmlp · on Aug 10, 2023

Given that CUDA main focus is C++ since CUDA 3.0, ignoring the other PTX sources for now, not sure if that 2014 book is the right approach to learn CUDA.

anonymousDan · on Aug 10, 2023

Can you elaborate a bit on how C++ affects the programming model? Isn't CUDA just a variant of C? I presume it is not the goal to run standard C++? Also as I understand it PTX is an IR so not sure why C/C++ can be compared?

pjmlp · on Aug 10, 2023

Not at all, unless we are speaking of CUDA until version 3.0.

CUDA is a polyglot programming model for NVidia GPU, with first party support for C, C++, Fortran, and anything else that can target PTX bytecode.

PTX allows for many other languages with toolchains to also target CUDA in some form, with .NET, Java, Haskell, Julia, Python having some kind of NVidia sponsored implementations.

https://developer.nvidia.com/language-solutions

While originally CUDA had its own hardware memory model, NVidia decided to make it follow C++11 memory semantics and went through a decade of hardware redesign to make it possible.

- CppCon 2017: Olivier Giroux "Designing (New) C++ Hardware”

https://www.youtube.com/watch?v=86seb-iZCnI

- The CUDA C++ Standard Library

https://www.youtube.com/watch?v=g78qaeBrPl8

It is also driving many of the use cases in parallel programming for C++

- Future of Standard and CUDA C++

https://www.youtube.com/watch?v=wtsnoUDFmWw

You will only find brief mentions of C here,

https://developer.nvidia.com/hpc-compilers

This is why OpenCL kind of lost the race, with it focused too much in its C dialect, only going polyglot when it was too late for the research community to care.

flakiness · on Aug 9, 2023

To ones who are interested: "Programming Massively Parallel Processors: A Hands-on Approach" is a great book to learn CUDA programming, and it talks mostly about performance because, after all, GPU is about speed.

Unlike normal programming books, it talks a lot about how GPUs work and how the introduced techniques fit in that picture. It's interesting even if you are just curious how a (NVIDIA) GPU works at code-level. Strongly recommended.

gpuhacker · on Aug 9, 2023

I bought the first edition when it came out, and definitely it was a gold mine of information on the subject. I wonder though, is the fourth edition worth buying another copy? Nvidia has been advancing CUDA, in particular moving more towards C++ in the kernel language. But none of that was present when this book came out in 2007. Now more and more stuff is happening at thread block level with the cooperative group C++ API and warp level for tensor cores. It would be great if the authors revisited all the early chapters to modernize that content, but that's a lot of work so I don't usually count on authors making such an effort for later editions.

flakiness · on Aug 10, 2023

I also read the older edition and got the 4th for the second read recently. I felt that the updated coverage is more on the GPU side than the language side. It covers new GPU features and architectures well. I don't think it covers Tensor core things. But I might be wrong.

So it's worth the update if you're interested in general NVIDIA GPU evolution.

gpuhacker · on Aug 10, 2023

Ah thanks! That's good to know.

AlexDenisov · on Aug 10, 2023

There are also video lectures which are almost 1-1 mapping of the book

Programming Massively Parallel Processors: https://www.youtube.com/watch?v=4pkbXmE4POc&list=PLRRuQYjFhp...

pokeypokes · on Aug 10, 2023

I have the book but didn't know about these, thanks for the link!

mathisfun123 · on Aug 9, 2023

> it talks a lot about how GPUs work

it's true - out of all of the "LEARN CUDA IN 24 HOURS" books, this is the best one. indeed this isn't one of those same books - this is a textbook - but at first glance it resembles them (at least the color scheme and the title led me astray when i first found it).

hgomersall · on Aug 10, 2023

How does it compare to the docs from Nvidia, which always struck me as fairly comprehensive?

w-m · on Aug 9, 2023

Does anybody have an idea on how to get in to Metal programming (as in Apple Metal)? I'd love to mess around a little with this on iOS and macOS while learning about tile-based rendering, but I have trouble locating educational written material.

There's a book (https://metalbyexample.com/the-book/), but the author has put up a note that it's quite out of date. It seems the most up-to-date information is available in the WWDC videos (regarding e.g. Metal 3), but I'd really prefer something written. And Apple's documentation reads more like a reference material and is quite confusing when starting out.

pjmlp · on Aug 10, 2023

There is a better one, focused on Swift.

https://www.amazon.com/Metal-Programming-Guide-Tutorial-Refe...

For the rest, yes, WWDC videos, samples, and then documentation, by this order.

winwang · on Aug 9, 2023

(+1) I'm a newb to Metal myself, and I wanted to use Swift as the driving language (which was a main selling point). Unfortunately, almost all the material is in Objective C.

pjmlp · on Aug 10, 2023

See https://www.amazon.com/Metal-Programming-Guide-Tutorial-Refe...

Metal is actually one of the few new frameworks that happens to be written in Objective-C, with Swift bindings.

winwang · on Aug 9, 2023

If people like GPU programming, I wrote a blog post this week about GPU-accelerated hashmaps, semi-provocatively titled "Can we 10x Rust hashmap throughput?".

HN post here: https://news.ycombinator.com/item?id=37036058

eachro · on Aug 9, 2023

I've been looking into getting into GPU programming, starting with CS334 (https://developer.nvidia.com/udacity-cs344-intro-parallel-pr...) on Udacity. I'm curious to hear from some of the more seasoned GPU veterans out there, what other resources would be good to take a look at after finishing the videos and assignments?

gpuhacker · on Aug 9, 2023

If you want to go really in-depth I can recommend GTC on demand. It's Nvidia streaming platform with videos from past GTC conferences. Tony Scuderio had a couple of videos on there called GPU memory bootcamp that are among the best advanced GPU programming learning material out there.

zetazzed · on Aug 10, 2023

100% this. You can find all kinds of detailed topics, like CUDA graphs, memory layout optimization, optimizing storage access, etc. https://www.nvidia.com/en-us/on-demand/. They have "playlists" for things like HPC or development tools that collect the most popular videos on those topics.

yzh · on Aug 9, 2023

I would recommend the course from Oxford (https://people.maths.ox.ac.uk/gilesm/cuda/). Also explore the tutorial section of cutlass (https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/...) if you want to learn more about high performance gemm. OpenAI triton is another good resource if you want to write relatively performant cuda kernels using python for deep learning (https://openai.com/research/triton)

pengaru · on Aug 9, 2023

https://shadertoy.com is a great way to explore shaders

pjmlp · on Aug 10, 2023

Indeed, with the caveat that it is constrained to GL ES 3.0 shader capabilities, minus what was removed for WebGL 2.0.

_a_a_a_ · on Aug 9, 2023

Partly related I believe so perhaps someone can help. Whole theses have been written on prefix sum algorithms, and I never got it. Perhaps someone kind can give some convincing examples of their advantages.

jhj · on Aug 10, 2023

Not speaking to their implementation, but prefix sums/scans are simply a very useful primitive tool for parallelizing many otherwise sequential operations. For instance, appending a variable number of items per worker to a shared coalesced buffer uses an exclusive prefix sum. This is probably the most common use case for them in practical programming. They can also be used to partition work across parallel workers (segmented prefix scans).

In lieu of pointer chasing, hashing and the like, parallel operations on flat arrays are the way to maximize GPU utilization.

shwestrick · on Aug 10, 2023

Tons and tons of parallel algorithms use prefix sums. Typically the most common use is to compute a collection of offsets in parallel. Some examples:

- compact a hash table (i.e., remove the empty slots)

- flatten a jagged 2D array

- rewrite a dense matrix in compressed-sparse-row (CSR) format

mschuetz · on Aug 10, 2023

It's used in one of the fastest sorting approaches - counting sort / binning - to compute the location of where to store the sorted/binned items. First you count the number of items per bin, then you use prefix-sums to compute the memory location of each bin, then you insert the items into the respective bins. Some radix-sort implementations also utilize counting sort under the hood, and therefore prefix-sums. (Not sure if all radix-sort implementations need it)

gpuhacker · on Aug 10, 2023

It's incredibly useful if you have many threads that produce a variable number of outputs. Imagine you're implementing some filtering operation on the GPU, many threads will take on a fixed workload and then produce some number of outcomes. Unless we take some precautions, we have a huge synchronization problem when all threads try to append their results to the output. Note that GPUs didn't have atomics for the first couple of generations that supported CUDA, so you couldn't just getAndIncrement an index and append to an array. We could store those outputs in a dense structure, allocating a fixed number of output slots per thread, but that would leave many blanks in between the results. Now once we know the number of outputs per thread we can use a prefix sum to let every thread know where they can write their results in the array.

The outcome of a prefix sum exactly corresponds with the "row starts" part of the CSR sparse matrix notation. So they are also essential when creating sparse matrices.

AndrewPGameDev · on Aug 9, 2023

Interesting timing on posting this to HN, I've recently been optimizing my WebGPU LSD radix sort. Today I measured it against the Thrust CUDA version, and it's about 10x slower (15ms to 1.5ms). My goal was to try to get 10 million elements in 1 ms, but now that I know 3 million in 1.5ms is impossible even for Thrust I know I won't be able to beat that.

gpuhacker · on Aug 10, 2023

I haven't tried WebGPU yet, is there an overall performance hit compared to direct CUDA programming?

AFAIK Thrust is intended to simplify GPU programming. It could well be that for specific use cases, in particular when it is possible to fuse multiple operations into single kernels, you could outperform Thrust.

AndrewPGameDev · on Aug 10, 2023

There is definitely at least a performance hit in that wgpu (and I think WebGPU too) only supports a single queue. That means you can't asynchronously run compute tasks while running render tasks.

Additionally Wgpu (the library) will insert fences between all passes that have a read-write dependency on a binding, even if there is technically no fence needed as 2 passes might not access the same indices.

Finally I know that there is an algorithm called decoupled look back that can speed up prefix sums, but it requires a forward-progress guarantee. All recent NVIDIA cards can run it but I don't think AMD can, so WebGPU can't in general. Raph Levien has a blog post on the subject https://raphlinus.github.io/gpu/2021/11/17/prefix-sum-portab...

pavelstoev · on Aug 10, 2023

Humble self-promo here, may I also recommend the team at CentML who dedicated their academic life (PhD and above) to GPU optimizations for high-performance ML/AI to lower the costs.

johnthescott · on Aug 10, 2023

getting errors when registering on centml website.