Author here, yeah I know....this fig is a trade-off between having a nicely looking scene but also being scientifically clean by stating what we did and did not do with our system.
Also everything the other comment said is correct :)
I think the intent of that graphic is to show how you can take a relatively simple map or set of assets and augment them in real time (with relatively low compute cost) to substantially improve the visuals without all of the overhead, latency, and cost that comes with storing, loading, CPU side processing, and memory transfer of those fully baked (or CPU side generated) assets.
GPUs are Turing complete computers. Why is it necessary to add a GPU programming design pattern like this as an extension to Vulkan? Can I just implement this design pattern in an existing GPU programming language right now, without waiting for Vulkan extensions to be implemented for all the relevant GPU models? If this isn't possible right now, then what's missing from existing GPU languages that prevents this kind of flexibility?
> GPUs are Turing complete computers.
> what's missing from existing GPU languages that prevents this kind of flexibility?
GPUs contain many Turing complete computers that are orchestrated by a non-Turing complete fixed function logic block which interprets your command buffers (as opposed to shader code). That component is different in every architecture.
The work graphs are about creating a scheduling system for the fixed function component.
You can't effectively do this in shaders only.
Unreal Engine's "Nanite" system uses something like a work graph in a shader. It is full of pitfalls and gotchas to make it work but you can read about it in the white paper.
The core issue here is that without workgraphs, each compute shader needs to write out its results to DRAM and finish all work it is doing. Then another compute shader is dispatched and it reads back all the results from DRAM.
With work graphs you can make a producer/consumer type of setup where a compute shader's outputs are fed as input to another compute shader without going through DRAM. It also removes the need to know how much output the producer produces ahead of time, no need to preallocate a buffer for the intermediate results.
DRAM bandwidth is typically the bottleneck in modern graphics so this makes sense.
Because every GPU on the market has a very different internal organization and instruction set.
They are able to execute any shader using standardized programming constructs (Vulkan, DirectX, Metal) that both the OS and driver understand. While the API and OS manage device/app context, the driver manages the device itself and the job of compiling shaders or standard calls to their GPUs specific instructions or layout.
They are Turing complete in the context of a single thread. But, a single thread does not control the entire GPU. For example: In multi-vendor APIs, a thread cannot spawn another thread.
The execution model of GPUs is complex and only specified at the high level in order to give manufacturers freedom in their implementations. Work Graphs are an extension of the execution model.
I’m wondering the same. Maybe it’d be easier if we could upload our own ‘sequential program’ to a little CPU inside the GPU that would be code making calls to the real GPU. This way delays to native GPU code would be minimized and we wouldn’t need to mimick creating new features / GPU paradigms
GPUs are setup as large amounts of SIMD blocks of threads with some shared units like registers, cache, ALU, etc for every few blocks of threads.
The typical way you schedule GPU work is to dispatch a very large amount of work all running the same program. So you'd spawn 10 million units of work, to run on 10 thousand thread blocks. Due to differing memory accesses per unit of work, each threadblock will complete their work at a different time. Whenever a threadblock is stuck waiting for memory accesses to complete, or has finished it's work, the GPU scheduler gives it a new unit of work to work on.
If you graph "occupancy" as the percentage of threadblocks busy doing work, then you'd see a spin-up period as threadblocks are filled with work, a steady period where all threadblocks are busy, and then a spin-down period as there's gradually less work available than the number of threadblocks.
If you wanted to run two programs (e.g., check meshes to determine what's visible and cull invisible meshes, then draw the remaining visible meshes), then you'd have a hard gap in between. Threadblocks would spin-up with culling work, work steadily, spin-down until 0 work is left, spin-up with mesh drawing work, work steadily, and then spin-down again. The spin-down period in between the two passes is bad for performance.
Rather than having the GPU go completely idle, wouldn't it be better if, as the GPU is running out of culling work to execute (available work is less than the number of threadblocks available to perform work), you could fill in the gaps by immediately starting the mesh drawing work for meshes that have already passed culling? That way there would be no idle time between passes. Workgraphs let you do this, by specifying execution not as monolithic passes, but as nodes that perform 1 unit of work and produce output for other nodes to consume.
Another benefit is memory allocation required. With the two distinct passes model, you need to allocate the worse-case amount of memory to hold the first pass output (input to the second pass). If you have 10,000 meshes, then you need to allocate space to be able to draw 10,000 meshes for the worst case that they're all visible - there is no runtime memory allocation on the GPU. With workgraphs, the GPU can allocate a reasonable estimate of how much memory it needs for node 1's output (input for node 2), and if the output buffer is full, the GPU can simply stop scheduling node 1 work, and start scheduling node 2 work to pop from the buffer and free up space.
As for whether you can do this with the existing GPU pass-based model, more or less yeah. You can build your own queue and use global atomics to control producer/consumer synchronization. It's called persistent threads. You might even do better than the GPU's built-in scheduling depending on the task, if you hyper-optimize and tune your code. However, it's a lot harder, and dependent on specific implicit behavior of the GPU's scheduler. If the GPU is not smart enough to sleep threadblocks waiting for the atomic "lock" and schedule threadblocks that are holding the lock, then you get a deadlock and your computer freezes until the GPU driver kills your program.
The big reason workgraphs are getting introduced is for Unreal Engine 5's Nanite renderer that came out a few years ago. It uses the persistent threads technique to traverse a BVH, doing cull checks on each node, with passing nodes getting their children pushed onto the queue. BVHs (trees) can be unbalanced, so Nanite uses the persistent threads technique to dynamically load-balance nodes amongst the available threadblocks. With workgroups, Nanite could instead define it as a graph, with culling nodes outputting work for more culling nodes, and have the driver handle the load balancing.
However, there is no direct equivalent to work graphs. This DirectX blog goes over the differences in DirectX’s equivalent to indirect calls and work graphs better
For those who don't want to click through:
(a) Without our system -- (empty courtyard) (b) With our system -- (same picture but with procedural generated content added)