It's interesting because in some ways, GPUs are much less specialized than CPUs ...

ummonk · on April 9, 2021

Yeah I disagree with the comparison of GPGPUs to specialized hardware. It’s more a distinction between hardware optimized for throughout of general highly parallel computations and hardware optimized for latency of highly serial computations.

lumost · on April 9, 2021

In practice it's pretty tough to make this work. If your ETL can't be turned into just math on a <12GB dataset then you will be contending with streaming data from io -> CPU -> GPU -> CPU efficiently. CPU's are really good at multi-threaded io given an evented runtime these days, but GPUs can only do 1 thing at a time. This means that your application will need to batch data to the GPU.

So the real race is between how good you can be at batching vs. parallel dispatch when reading disk from I/O. As parallel dispatch is a more general problem it tends to get more attention.

SatvikBeri · on April 9, 2021

"Math on datasets that can be parallelized" is a pretty huge swathe of use cases in the Data Science/Data Engineering world, I've probably spent at least 30% of my time on similar problems.

At my first job we used some special SAS product to handle data larger than memory without getting much parallelism, then it was Hadoop, then Spark. Now I can write Julia code that is agnostic over the CPU or GPU and vastly outperforms for the same types of jobs, and where I can run the same code on my laptop or a cluster. It's a huge advance for my domain! I agree it probably doesn't apply to most engineers though.

lumost · on April 9, 2021

Even in NLP/Ad-tech/CV spaces a lot of the time is spent in cpu/io bound tasks such as featurization and reading datasets off of disk and shuttling them to GPU. On my most recent model training jobs in TF w/16GB of working memory I sit at an average of ~40% GPU utilization.

Some of this overhead is language specific, and some is due to shitty code. Never the less I'd bet if I didn't need to shuttle memory to GPU or could do multiple things at a time I'd crush my perf number. ( noting of course that the 40% gpu util is roughly 10x better than a CPU )

SatvikBeri · on April 9, 2021

Interesting, that makes sense. This inspired me to run some checks. The CPU version of the pipeline I'm working on spends about 90% of its time with ~100% CPU utilization in a massively parallel process and 10% of its time on other stuff, mostly io. With the GPU making the massively parallel part ~10x faster, io is now the dominant portion of my code – Amdahl's law in action!

sdenton4 · on April 9, 2021

Tensorflow datasets are pretty great, once you get them really rolling. They do a great job of scaling out worker threads for various parts of the featurization process, keeping an arbitrarily big cache of batches ready to go in ram, etc.

https://www.tensorflow.org/api_docs/python/tf/data/Dataset

lumost · on April 9, 2021

aye - the above job was done with TF datasets. They are limited in that the python gil requires multiprocessing, multiprocessing involves serialization in python, serialization involves dealing with the rather extreme object overhead in python ( 24 bytes for an int! ).

Which all means there's a bunch of CPU bound stuff between your job and the GPU/Cuda kernels. How fast your app can deal with the above will influence overall GPU utilization.

FridgeSeal · on April 9, 2021

While I appreciate that Python is easy and flexible enough to write, it bugs me immensely every time I run into situations like this.

We go to all this effort to build and write stuff in this optimising, fancy framework only for the whole process to be bottlenecked by some silly performance limitation in Python.

sdenton4 · on April 9, 2021

It's usually possible to sidestep the python limitations with a bit of elbow-grease. The usual killer for performance is tf. py_function, which does indeed have to respect the GIL. If you can work out a nice way to handle your data without it, it should be able to stick to cpp in the backend and avoid the GIL. (So, data in a format that tf has a parser for, and write transformations using tf methods where you can.)

ummonk · on April 9, 2021

It's only going to be IO-bound because the GPU speeds up the computation so much.

01100011 · on April 9, 2021

> GPUs can only do 1 thing at a time

Wut? Modern GPUs(ahem, Nvidia) can pipeline very effectively. You can copy to and from the GPU while the GPU is busy working on something already loaded into memory. True, host copies suck, but you can hide them behind compute delays in many workloads.

With GPUDirect, you can even skip the CPU and DMA straight between the GPU and I/O(storage or network controller).

vlovich123 · on April 9, 2021

Or avoid shuttling the data back and forth and go with a unified architecture like the M1. Memcpy is terrible and yet we keep trying to build bigger and better pipes to make it faster with diminishing returns.

nwallin · on April 9, 2021

> Or avoid shuttling the data back and forth and go with a unified architecture like the M1.

AMD has been shipping unified architectures since August 2011. Nobody cares.

The problem is that while a small low power CPU and a small low power GPU living on the same die and sharing the same memory controller will happily share memory, a big honkin' CPU and a big honkin' GPU don't like to share a memory controller. Some of the big advancements in absolute performance in recent years stems from making the memory controller closer and closer to the relevant processor. In recent AMD systems, the pins on the CPU are directly wired to the pins on the RAM DIMMs. Relatively old mother boards shipped when DDRx memory was the standard, and were still compatible with DDR(x+1) CPUs and RAM when those became the new standard; the motherboard has no RAM logic. If you wanted a unified architecture, you're either going to make the CPU talk to RAM through the GPU, or the GPU talk to RAM through the CPU, or go back to the early 2000s and put a north bridge on the motherboard.

In addition to all that, the bottlenecks are different. The bottleneck on a GPU is always, 100% of the time, memory bandwidth. Latency is mostly irrelevant. The bottleneck on a CPU is usually (95%? 99%?) the memory latency. There's no such thing as RAM that will perform well for both a CPU and a GPU.

lumost · on April 9, 2021

I strongly suspect that this is Nvidia's next big move. A giant CPU/GPU hybrid with unified memory. Who doesn't want threadripper scale Nvidia/ARM cores?

tremon · on April 9, 2021

Yes, unified memory being the key here. FAFAIK GPUs still use a hub-and-spoke model, where all processing is done in the hub but data is stored on the rim. They achieve their speedup by increasing the number of spokes (data channels). You could do better by moving the data processing to the spokes as well, and using the hub only for orchestration and synchronization.

At least, it's computationally better. But it won't allow scaling the processing power and memory size independently, so it might require a different commercial model.

einpoklum · on April 9, 2021

I would suggest a different metaphor, since these days, CPUs also have a bunch of cores:

* GPUs invest more chip real-estate in integer and floating point crunching (ALU, FP).

* CPUs invest more chip real-estate in control logic, prediction and speculation.

Not that these are the only hardware differences of course, like how caching works, and larger vs higher-bandwidth memory etc.

tapirl · on April 9, 2021

CPUs will have more and more cores to reduce the gap.

Maybe future CPU and GPU and TPU will merge as a new kind of compute unit.