I wonder if someday GPUs will be so powerful that we will just program them with...

kllrnohj · on March 15, 2022

The complexity of Vulkan or shaders has very little to do with the fixed-function parts of the GPU. The complexities of Vulkan are around the realities of talking to a coprocessor over a relatively low-bandwidth, high-latency pipe. It's not unlike, say, gRPC or whatever other remote call API you want. So you end up building command queues so you can send a batch of work all at once instead of dozens or hundreds of small transmissions. And then need to deal with synchronization between those. And then memory management of all that.

Very little of that goes away as the GPU becomes more programmable / powerful. Rather it gets ever more complicated, as suddenly a texture isn't just a texture anymore. It's now a buffer. And buffers can be used for lots of things. This complexity really could only go away if the GPU & CPU merged into a single unit, which isn't entirely unlike Intel's failed Larrabee as a sibling comment mentioned.

As for shaders, those are just arbitrary programs. The complexity there is entirely the complexity of whatever your renderer does, compounded by the extreme parallelism of a GPU. So this complexity really never goes away.

For your core question, the problem is that fixed function hardware is just always faster & more efficient than a programmable one. So as long as games commonly do the same set of easily ASIC-able work (like render triangles), then you really won't ever see that fixed-function unit go away. But something like Unreal Engine 5's Nanite is kinda the productization of the idea of doing triangle rasterization in a "software" renderer instead of the fixed-function parts of the GPU: https://www.unrealengine.com/en-US/blog/understanding-nanite...

nicoburns · on March 16, 2022

Does unified memory as in Apple’s M chips help with reducing the complexity?

kllrnohj · on March 16, 2022

Only a little. The bulk of the complexity for large data like textures isn't around the dma transfer, it's instead around things like ensuring data is properly aligned, that things like textures are swizzled if that format is even documented at all, and ensuring it's actually safe to read or write to the buffer (that is, that the GPU isn't still using it). Also in actually allocating memory in that a malloc/free isn't really provided, rather something like mmap is instead. So you want a (re)allocator on top of that.

And there's also the complexity of things like Vulkan want to work on both unified and non-unified systems.

Also unified doesn't necessarily mean coherent, so there's additional complexities there.

ianlevesque · on March 16, 2022

Yes, and Apple abstracts over it even on discrete GPUs for better or worse.

https://developer.apple.com/documentation/metal/setting_reso...

azornathogron · on March 15, 2022

Wasn't this the idea behind Intel's Larrabee hardware?

Didn't succeed at the time. Maybe it'll happen one day. Or maybe not.

If Vulkan and co. are too difficult personally I suspect it's more fruitful to build better abstractions on top of the underlying constraints dictated by the need for massive parallelism, not trying to make the x86-style programming paradigms fast enough for graphics-type workloads.

raphlinus · on March 15, 2022

I think your question is partly answered by the cudaraster work, which is well over a decade old at this point. They basically did write a software rasterizer (that ran on CUDA, but is adaptable to other GPU intermediate languages). The details are interesting, but the tl;dr is that it's approximately 2x slower than hardware. To me, that means you could build a GPU out of a fully general purpose parallel computer, but in practice the space is extremely competitive and nobody will leave that kind of performance on the table.

I think this also informs recent academic work on RISC-V based GPU work such as RV64X. These are mostly a lot of generic cores with just a little graphics hardware added. The results are not yet compelling compared with shipping GPUs, but I think it's a promising approach.

[1]: https://research.nvidia.com/publication/high-performance-sof...

pavlov · on March 16, 2022

Also there was the Cell processor used in the PlayStation 3. It was a kind of small CPU cluster on a chip, very different architecturally from GPUs.

Cell was originally supposed to act as both the CPU and GPU on the PS3, but that plan didn’t pan out in the actual hardware and Sony scrambled to include an Nvidia GPU.

Buttons840 · on March 16, 2022

CUDA seems like a fine language, although a standardized language would be better. The difficulties of CUDA come from the realities of the hardware, and other languages can't change that. I'd love to be wrong about this though, what specifically do you think could be improved about CUDA (not saying this as a challenge, just an invitation for more conversation)?

pjmlp · on March 16, 2022

One of the reasons CUDA won over OpenCL is because it is a polyglot runtime, you are mixing concepts there.

ozarker · on March 16, 2022

I've heard Unreal Engine 5's Nanite tech described as a software rasterizer implemented in compute shaders. https://docs.unrealengine.com/5.0/en-US/RenderingFeatures/Na...

rowanG077 · on March 15, 2022

I don't see how shaders or CUDA are not what you talking about? You do have higher level languages like SAC or Futhark that can target GPUs but they essentially can do what CUDA can just with a different lick of paint.

pjmlp · on March 16, 2022

You can already run C++ as shader language, and the future are some sort of mesh shaders.