CPU-based algorithm trains deep neural nets up to 15 times faster than top GPU

exegete · on April 9, 2021

Paper: https://arxiv.org/abs/1903.03129

Code: https://github.com/keroro824/HashingDeepLearning

Edit: Here is the new paper referred to https://arxiv.org/abs/2103.10891

Here is the new code: https://github.com/RUSH-LAB/SLIDE

The paper I linked above was the first paper of which this is supposed to be an improvement.

guywhocodes · on April 9, 2021

This is the initial SLIDE paper, this is supposed to be an improvement on top of SLIDE. Having looked at and run the original implementation I'm a lot less skeptical than the average post here. It was quite fast already and should be considered a toy implemenation.

foerbert · on April 9, 2021

I think the reason you're getting so many questions about the 'toy' statement is that often toy programs are simple in large part because they omit handling complex or difficult cases. I don't think I've ever heard the term applied to a complete problem that is merely implemented naively.

As a result, saying it's a toy implementation is making people think you mean the speedup is a result of simply handling a simple portion of the problem, rather than thinking you mean even a naive implementation is quite fast.

gumby · on April 9, 2021

Did you mean, "should not be considered a toy" or "this original was a toy/PoC and this new paper is much better"?

guywhocodes · on April 9, 2021

> this original was a toy/PoC and this new paper is much better

This one hundred percent, the original has very few optimizations over a completely naive implementation. It uses MPI and huge pages, that's essentially it.

happyopossum · on April 9, 2021

What do you mean by ‘toy implementation’?

guywhocodes · on April 9, 2021

PoC for fun

d110af5ccf · on April 9, 2021

Is there anything preventing this algorithm (or a substantially similar one) from being used on the GPU?

0-_-0 · on April 9, 2021

After a quick read of the paper, no. You could adopt this to the GPU (which would require the hashes work on groups of neurons instead of individuals) and might get a similar speedup. Locality sensitive hashing in fact seems like a primitive attention mechanism, with proper attention implementation you could get maybe even better results.

sumnuyungi · on April 9, 2021

The title is a bit misleading as this algorithm is for feedforward networks and doesn't yet support convolutional layers or any of the SOTA techniques for image classification... which is why GPUs reign supreme for training deep neural nets.

choppaface · on April 9, 2021

NeRF is a good example of a network that doesn't have convolutions yet requires a ton of iterations to train. This paper is particularly relevant to wide networks which are important because CPU memory is currently much cheaper than GPU memory (even for FANG researchers!).

sumnuyungi · on April 9, 2021

Interesting, I didn't know that NeRF was simply a feedforward network.

I hope that this research group can make more headway into training on CPUs, but I also would like to (naively) see less hyperbolic titles. This paper is not just particularly relevant to wide networks - it's only relevant to wide networks.

gugagore · on April 9, 2021

I think you mean to say "fully connected" in place of "feed-forward" when trying to draw a distinction with respect to "convolutional".

stjohnswarts · on April 9, 2021

This is why I come to HN, to find out why it doesn't work in the general case. I can always count on you guys to point out why something is an evolutionary change rather than revolutionary

waheoo · on April 9, 2021

> doesn't yet

Does that mean it can / will?

sumnuyungi · on April 9, 2021

The original paper includes convolutional layer support in their future work & next steps. But it's not a foregone conclusion that the same speedup will occur.

ramoz · on April 9, 2021

True. I know it exists for inference though. Wondering where/when solutions like MKL might work for training.

shgidi · on April 9, 2021

Right, that's kinda nasty. Titles of papers refer deep learning, but I don't think fully connected networks might be considered a as deep learning.

sanxiyn · on April 9, 2021

What? No. Fully connected networks are deep learning, and actually the most important deep learning workload. See: Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective:

https://research.fb.com/publications/applied-machine-learnin...

Table 1 shows News Feed service uses fully connected networks model, and table 3 shows this workload dominates all other workloads.

Cybiote · on April 9, 2021

Transformers, which are currently waging a successful campaign to conquer all Deep Learning, are largely stacked feed-forward networks, matrix multiplies and maps. Some ideas to make attention more scalable, such as LSH or large sparse attention matrices seem like they'd be well suited to this approach.

Their approach should also be readily adaptable to RNNs, including LTSMs.

Certainly worth investigating as an alternative for efficiently running and training giant networks on less expensive hardware.

novaRom · on April 9, 2021

ELI5:

1) any input vector can be represented by a "similar" another vector, e.g. [.4, -.3, .1] => [+1, -1, 0]

INIT:

2) generate a set of random input vectors using only {+1, 0, -1} numbers

3) for every neuron: compute a set of activations for every random input vector

4) combine those sets producing a large hash table where keys are random vectors and values are indices of active neurons

FEED-FORWARD:

5) for every input vector, find its nearest neighbor, e.g.: [.4, -.3, .1] => [+1, -1, 0]

6) use this vector as hash key to get indices of neurons

7) compute activations for only these neurons using only their weights and the input vector

8) repeat for all layers

etc.

This way, you avoid lots of computations, but the key is to have a good hash function and a lot of memory.

peter_d_sherman · on April 9, 2021

>"Deep neural networks (DNN) are a powerful form of artificial intelligence that can outperform humans at some tasks. DNN training is typically a series of matrix multiplication operations, an ideal workload for graphics processing units (GPUs), which cost about three times more than general purpose central processing units (CPUs).

"The whole industry is fixated on one kind of improvement—faster matrix multiplications," Shrivastava said. "Everyone is looking at specialized hardware and architectures to push matrix multiplication. People are now even talking about having specialized hardware-software stacks for specific kinds of deep learning. Instead of taking an expensive algorithm and throwing the whole world of system optimization at it, I'm saying, 'Let's revisit the algorithm.'"

Shrivastava's lab did that in 2019,

recasting DNN training as a search problem that could be solved with hash tables.

Their "sub-linear deep learning engine" (SLIDE) is specifically designed to run on commodity CPUs, and Shrivastava and collaborators from Intel showed it could outperform GPU-based training when they unveiled it at MLSys 2020."

PDS: Quote: "All programming is an exercise in caching." - Terje Mathisen

trott · on April 9, 2021

This is currently #1 on the front page, but I think HN is celebrating the defeat of GPUs prematurely. The architectures used in this paper have just one hidden activation layer. It remains to be seen whether the ideas in SLIDE are applicable to general deep learning tasks and architectures.

From the previous paper:

    We choose the standard fully connected neural
    network with one hidden layer of size 128.

0x008 · on April 9, 2021

Wait... How can 1 Hidden layer be called DEEP neural network? I thought the whole "deep" thing was about having lots of hidden layers.

It's like saying we have a super fast fibonacci algorithm, but we can only show it for n=1.

trott · on April 9, 2021

> Wait... How can 1 Hidden layer be called DEEP neural network?

Yes. They describe a general algorithm, but show results for shallow nets. So it's unclear if this works well in general.

ladberg · on April 9, 2021

Yep, I feel like anyone realizes when they first start playing around with NNs in a jupyter notebook that you get faster training on the CPU until you level up to bigger networks.

freeone3000 · on April 9, 2021

That's normally due to the cost of copying the data to GPU space. Here, they've reformulated the training algorithm to be a less-computationally-complex problem. It's a huge difference.

high_byte · on April 9, 2021

so you think it was only faster because GPU init overhead?

guywhocodes · on April 9, 2021

It's not the first similar claim, there was one that large CPU's can beat GPUs by creating a very sparse matrix using spatial hashing. Preserving the gradient but finding it with vastly less computation by simply not multiplying elements that don't have much or any impact.

EDIT: orginial work afaik https://arxiv.org/abs/1903.03129

cat199 · on April 9, 2021

This is literally the same research group, and the paper you link to is referenced in the article as being the predecessor 2020 publication to this announcement of a forthcoming 2021 paper expanding on the 2020 one.

So yes and no - if you are correct and that paper is indeed the first claim (not up on the literature), this one is also the same first claim, or at least, an extension to it

AndrewKemendo · on April 9, 2021

AKA work smarter not harder

bigmattystyles · on April 9, 2021

There are basically no details in the article or the linked one (inside the article). I'll keep an open mind but this is the sort of article that should make you roll your eyes. (◔_◔)

freeone3000 · on April 9, 2021

It's legit. For the network types this works on, it's much less computationally complex. A lot of the networks we use now are based on the assumptions of fast matrix multiplication, so it's not universally applicable -- but there's a ton of networks this does work on, and there's a good indication that we can rephrase a model to solve the same problems in the hashtable space.

dragontamer · on April 9, 2021

Hmmm, it's a bit clickbait to make this a cpu vs gpu claim, unless there was a reason they believe that SLIDE wouldn't work on a GPU.

carlmr · on April 9, 2021

If I understand the paper correctly it's because GPUs are good at doing a lot of parallel matrix multiplications, at the cost of copying to and from GPU memory. The CPU has the advantage that it doesn't need to copy, and it's more flexible, but not as parallel, so if you use caching and limit the parallelism of the computations by exploiting sparsity the CPU wins as it does here.

So I don't see the clickbaitiness here, it's the central claim of the paper.

dragontamer · on April 9, 2021

If you are training on the GPU, the bulk of the weight data never leaves the GPU in the first place.

It's a fundamentally different programming style, which makes comparisons extremely difficult.

Furthermore: you hide latency on the GPU by running more and more parallel instances. Yeah, there is a 5 to 10 microsecond delay in all CPU to GPU comms and back, but just double, triple, or 100x fold the parallelism and batch up more tasks (aka Gustafson's law).

Don't run one SLIDE. Run 100x of them in parallel to hide the latency. AMD Vega 64 has 16384 hardware threads (one instruction every 4 clock ticks across 4096 hardware SIMD cores) with up to 10 occupancy. That's 163840 max hardware SIMD threads of conceptual execution. (Occupancy is similar to hyperthreads/SMT).

In practice, you run out of VRAM before running out of threads typically. 1000x instances uses 1000x the RAM.

---------

Anyway, I've seen an entire generation of programmers try to make CPU only algorithms and only fail year after year. The cryptocoin community would generally prefer CPU mining rather than GPU or ASIC mining.

But time and time again, the algorithms adapt and someone is brilliant enough to make the new alleged CPU algorithm work really well on a GPU.

stjohnswarts · on April 9, 2021

It's hard to compare in general but it's not hard to compare when you can take the same problem and run it in the "blackbox" of each system and compare performance. That's actually really easy. Now whether you can say "in general" would require much more varied comparisons than just a couple in an academic paper. Most people don't care what's in the black box.

dragontamer · on April 9, 2021

I mean, I can invent an algorithm that runs poorly on x86 CPUs but good on GPUs rather easily.

Ex: GPUs have a bit-reverse instruction, but x86 is missing that instruction. I've invented hash algorithms that only work well on GPUs (or really, architectures with single-cycle bit-reverse) and will run poorly on CPUs.

----------

But when I run the idea on a CPU, I don't do bitreverse. I use bswap64 instead. Because x86 has single-cycle bswap. Its technically a different algorithm with a different result, but bswap shuffles the bits around enough that the hash-algorithm is still really fast and really good (passes a lot of randomization tests I've done).

----------

We're looking at instruction-level differences between two grossly different platforms. Anyone who is an expert at both systems will tell you that optimization and tuning for both systems are so grossly different, that its pretty much non-sensical to try to compare them in "black-box" manners. You need to put some level of effort into processor-level tuning if you want the results to stand.

Roark66 · on April 9, 2021

My understanding is that this works better because it is faster to hash their data (and compare with prehashed weights) rather than multiply the data and calculate activation functions.

I imagine this could work for more layers easily,and using other than fully connected layers could be done by partitioning data (for each set of inputs).

The next obvious question is, can their hashing function run even faster on a GPU?

They say: >In particular, SLIDE uses recently proposed DTWA hashing which works nicely on sparse data. If we represent the data vector as a list of indices and values. DTWA operation computes a random hash map of every non-zero index in the list. Then the map is partitioned into a fixed number of bins. Finally, the index with the maximum coordinate value in each bin is the hash value.

What is the "coordinate value"? And how does the above map to find instances where data and weights values would result in a value above the activation threshold?

They themselves mention it "works well on sparse data".

If they detect irrelevant data by simple value comparison(wrapped in that hashing function) couldn't GPU algorithm be sped up by skipping those multiplications too? How about instead of running matrix multiplication on entire data and all weights we "preprocess" data and weights finding specific indices where they are both above a certain value and multiply just those. It would be interesting to try that. How expensive would running such comparison, then data copy (or create a map of indices to process), then process only those chosen be in comparison with multiplying everything every time?

I'm guessing that it really depends on what percentage of data is actually relevant.

Another idea to make GPU AI faster is improving scheduling of those matrix multiplication. Multiplication by zero, another small number or a power of 2 should be able to execute faster than multiplication of high values (assuming it is implemented in hardware by addition and bit shifting if multiplying by powers of 2). I don't know if current GPU algorithms make any use of that, or if they simply divide the data by number of cores and run all chunks in parallel for however long the longest chunk takes to complete?

I think GPUs are pretty far from being dead in AI as a result of this.

RcouF1uZ4gsC · on April 9, 2021

> Study co-author Nicholas Meisburger, a Rice undergraduate, said "CPUs are still the most prevalent hardware in computing. The benefits of making them more appealing for AI workloads cannot be understated."

I don’t think this statement means what he thinks it means. I think “cannot be overstated” is what is desired.

polyrand · on April 9, 2021

This reminds me of this paper that also caught my attention. They used an array language to make convolutions run very fast on CPUs.

Array Languages Make Neural Networks Fast: https://arxiv.org/abs/1912.05234

gumby · on April 9, 2021

Here's the new paper referred to in the article: https://arxiv.org/abs/2103.10891

nestorD · on April 9, 2021

One big caveat is that they use a parallel algorithm on a 44 cores machine to compete with a single GPU.

I can train a neural network on the GPU in my laptop but it has only 8 cores.

justAnIdea · on April 9, 2021

It all depends on economics, what you have now doesn't really matter a couple years down the line when the simple progress of time will show if this research has been used and built upon or not. 44+ cores, that's already available for workstations, and with the 4th generation of Threadripper (that's supposed to be shipping this year), that amount of cores is probably going to be even cheaper than it is now. On the other hand, GPUs are absurdly expensive right now as a result of multiple factors (people being home and gaming due to covid, crypto miners, lower supply of parts due to covid), but we'll see in a year.

Also, I'm pretty sure the goal of this research was not that you can train a DNN on your laptop, but that there is more democratization in the distribution of training power. Maybe some day a mid sized organization will be able to train a competitive model on a buch of off the shelf (or rented) servers...

nl · on April 9, 2021

Note that this is an alogirthmic improvement, and the same algorithm wasn't implemented on the GPU version they are comparing it to.

I don't see any particular reason why we wouldn't see a similar speedup with the GPU version.

unnouinceput · on April 9, 2021

If an algorithm runs faster on CPU than GPU, then it's only a matter of time before it gets optimized for GPU and at the end of the day will still run faster on GPU.

HelloNurse · on April 9, 2021

At the very least, generic types of parallel training are applicable to these models (e.g. an ensemble of models that see batches in a different order or are initialized differently).

jdeaton · on April 9, 2021

> "The cost of training is the actual bottleneck in AI,"

This really isn't true in general. Maybe in select subfields of AI like language model pertaining this is true.

canjobear · on April 9, 2021

What other bottlenecks do you have in mind?

RhysU · on April 9, 2021

Researcher time: Data munging, apples-to-apples model evaluation, identifying and fixing broken assumptions only detectable due to subtle model behavior.

plaidfuji · on April 9, 2021

Sure, but I assumed he was referring to the biggest computational bottleneck. All of those others are just tooling problems or data quality problems. For DNNs the hours-to-weeks-long training times make it hard to iterate manually or do any kind of rational optimization of architecture and hyper parameters.

RhysU · on April 9, 2021

Sure, if we just had stationary, perfect data and perfect objective functions it's only computational, operational, monitoring, and maintenance complexity that holds us back.

g42gregory · on April 9, 2021

Latest 2021 Code: https://github.com/RUSH-LAB/SLIDE

Ambix · on April 9, 2021

Interesting. As I know, Stockfish [1] chess engine exploits NNUE nets optimized for CPU and regularly beats up Leela [2], which explots traditional GPU-based networks based on AlphaZero architecture.

[1] https://github.com/official-stockfish/Stockfish

[2] https://github.com/LeelaChessZero/lc0

xaduha · on April 9, 2021

Reminds me of https://en.wikipedia.org/wiki/Cell_(microprocessor) and how that was considered a bad direction for game AI and general game depth, at least some were proclaiming it as such, that it would lead to prettier, but dumber games. Didn't make much sense to me at the time and in general I think number cruncher cores were a boon for AI.

monocasa · on April 9, 2021

Game AI, being simple symbolic AI, has really branchy unpredictable instruction streams. The SPEs were really more optimized for GPU like tasks with clean loops where you can gang together a few SIMD lanes and process them all at the same time.

All of that being said, I've heard from a few PS3 devs that the simple PPEs didn't really do a great job with that kind of AI code either, due to their extremely simple core designs including their branch predictors. So it didn't end up being a concern and a lot of AI ended up running on the SPEs just because they could be laying around with nothing better to process.

At the end of the day, game AI wasn't really held back by a particular console's design.

baybal2 · on April 9, 2021

I think there is a more direct comparison; Amdahl Trilogy https://en.m.wikipedia.org/wiki/Trilogy_Systems

A great case of half of the industry riding into a concrete wall following the trendsetter.

Prof-Anshu · on April 9, 2021

https://intel-hpc-ai-pavilion.gallery.video/detail/videos/hp...

29athrowaway · on April 9, 2021

Just a reminder that biological networks do not do backpropagation.

There's a lot yet to learn about neural networks and the state of the art is still biological evolution.

zepmck · on April 9, 2021

I think this work is quite ridiculous. As far as I could see, they trained the net with a batch size equal to 1.

Random_browser · on April 9, 2021

Seems to work well for sparse data, I like these little clever tricks that yield big impacts.

villgax · on April 9, 2021

Dang, title is misleading. Should be shallow neural networks

ilaksh · on April 9, 2021

I wonder if this is somehow similar to Reformers.

The_rationalist · on April 9, 2021

if you aren't fixated on matrix multiplications does this means it is only faster for networks up to a size? If so what is the estimated threshold?

hebetude · on April 9, 2021

This post has Intel Inside.

satellite2 · on April 9, 2021

It seems that Intel new CEO is bringing his connections with ycombinator troll farms.

abakus · on April 9, 2021

I hope this or related technology won't be used for cryptomining.

rbanffy · on April 9, 2021

If it does, we’ll have a lot of surplus GPU hardware available to run our number crunching.

Or room heating

The_rationalist · on April 9, 2021

Is it faster than deepspeed zero 3?

baybal2 · on April 9, 2021

Just as expected.

Turning back to nineties, it's easy to see that almost all ventures into specialty hardware ended up with mainstream consumer CPUs catching up, and swallowing the niche in a few years time as computer science, and logic design advanced.

Very few computing tasks came out to be truly brute force demanding as people learned how things really work.

Multi-channel audio, and sound effects on CPUs — once thought to be impossible, are now everywher, even smartphone chips.

Getting there was very tough though. Very few people can write a software DSP, and accoustically correct mixers with real time performance even today.

IvanAchlaqullah · on April 9, 2021

Reading comments in this thread, it feels like people still can't believe that in some case neural networks can be faster on CPU instead of GPU.

In fact, there is already real world use for neural networks that only optimized for CPU: an chess engine. More specifically, chess engine that use NNUE (Efficiently Updated Neural Networks) [1] like Stockfish 12 [2]. It run much faster while consuming less watts compared to GPU, can be run on average CPU, and managed to beat GPU based neural networks! [3]

This model (NNUE) already exist far earlier than the model discussed in this thread, yet there is almost no discussion about it on HN nor Reddit's r/MachineLearning

[1] https://www.chessprogramming.org/NNUE

[2] https://www.chessprogramming.org/Stockfish_NNUE

[3] https://www.chess.com/blog/the_real_greco/evolution-of-a-che...

nl · on April 9, 2021

NNUE is weird. No one outside the chess/shogi community talks about it because they seem to have a very case of strong not-invented-here syndrome. It's hand-optimised CPU code that doesn't (yet) run on the GPU (which is why it is "faster").

To be fair, they do want to embed it into a consumer friendly application, and the integration for embedding TF or something that can run pytorch models on a GPU without python is non-trivial.

If someone ported it to a GPU, it almost certainly would be faster. See http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986

There is a PyTorch port available, but no benchmarks unfortunatly. It does seem to be fairly widely used for training though, which is indicative of the speed gains available.

mantap · on April 10, 2021

GPU have high latency. For a chess engine like stockfish which is designed to search as many positions as possible, the latency of a GPU is a big problem.

Engines like LC0 that do use the GPU work by searching fewer positions but with a heavier eval function. This makes the latency less relevant because it is a smaller percentage of the GPU time.

nl · on April 10, 2021

3D computer games are much more latency sensitive than chess, and work on GPUs very successfully.

This seems like a solvable problem.

johbjo · on April 10, 2021

Board game AI works by searching through the state space, and evaluating each state with the neural network, then picking the move with best expectation.

So it needs to load the comparatively tiny game state (chess board) into the GPU for each evaluation. The more game states it can evaluate per move, the better it is. It can be in the order of millions.

hakuseki · on April 9, 2021

Is NNUE used by any Go AI? I've only heard of it being used for shogi and chess.

nl · on April 9, 2021

Yes you are correct. I wrote "go" when I meant "shogi" - fixed.

ummonk · on April 9, 2021

Was the NNUE trained on the CPU though? It’s intuitive that with the small size of a chessboard and the ability to use incremental update of evaluations, it will be faster to do neural network evaluation on CPU. Training, on the other hand, is typically highly parallelizable, so it would be a big development if we are able to do it faster on CPU.

Cybiote · on April 9, 2021

It looks like training is on CPU too: https://github.com/nodchip/Stockfish/blob/master/src/nnue/ev....

whimsicalism · on April 9, 2021

World of difference btwn "can be run" and "is trained on"

bob1029 · on April 9, 2021

You will find most developers don't actually understand how fast a single x86 core can be when used appropriately. Most are too busy hiding from real hardware concerns to think about such things.

crazygringo · on April 9, 2021

> almost all ventures into specialty hardware ended up with mainstream consumer CPUs catching up

All except for... 3D rendering, image compositing, video encoding, and video decoding?

I'm really struggling to see what you're trying to argue, because we do have multiple meaningful brute-force tasks that most computers today ship with dedicated silicon for.

Yes we can do more real-time audio processing I guess, but that's mainly because CPU speeds increased, not because "people learned how things really worked", or am I missing something?

aidenn0 · on April 9, 2021

Don't forget encryption

gumby · on April 9, 2021

> Turning back to nineties

And 80s (consider Britton-Lee).

And the opposite too: "mainframes are too complicated; we'll write an OS that does all IO in the kernel (e.g. Unix) and run it on minis that don't have channel controllers". Come the 2000s and every disk drive and even keyboard has a processor it it.

And edge-vs-core computing in the network...

The best part is hardly anyone reads the literature much less history, so if you've lived through a couple of cycles you can see the pendulum starting to swing back and, to mix metaphors, "skate to where the puck will be"

Ericson2314 · on April 9, 2021

Amen to that. I do welcome the return of "spread out" computing. I agree ease of use is important, but putting everything the CPU microcomputer style makes for a very imperative notion of ease of. Ergonomics without that would be much more elegant.

01100011 · on April 9, 2021

Random thought: computing may very well be spread out one day if computational machinery is cheap to build but limited by heat. I could see computers becoming much, much larger, but with a smaller duty cycle for a particular compute element to assist with thermal management. Imagine building structures, like tables, walls, or skyscrapers, out of nothing but doped silicon.

SatvikBeri · on April 9, 2021

It's interesting because in some ways, GPUs are much less specialized than CPUs – GPGPU programming mostly involves treating the GPU like a huge, dumb collection of very simple processors. I expect effects in the opposite direction as well, i.e. that programming languages get better at using GPUs to do a wider set of workloads. For example, I was recently able to prototype an ETL pipeline with no machine learning or matrix multiplication to a GPU and get a ~10x price/performance improvement, which would have been a lot harder even 5 years ago.

ummonk · on April 9, 2021

Yeah I disagree with the comparison of GPGPUs to specialized hardware. It’s more a distinction between hardware optimized for throughout of general highly parallel computations and hardware optimized for latency of highly serial computations.

lumost · on April 9, 2021

In practice it's pretty tough to make this work. If your ETL can't be turned into just math on a <12GB dataset then you will be contending with streaming data from io -> CPU -> GPU -> CPU efficiently. CPU's are really good at multi-threaded io given an evented runtime these days, but GPUs can only do 1 thing at a time. This means that your application will need to batch data to the GPU.

So the real race is between how good you can be at batching vs. parallel dispatch when reading disk from I/O. As parallel dispatch is a more general problem it tends to get more attention.

SatvikBeri · on April 9, 2021

"Math on datasets that can be parallelized" is a pretty huge swathe of use cases in the Data Science/Data Engineering world, I've probably spent at least 30% of my time on similar problems.

At my first job we used some special SAS product to handle data larger than memory without getting much parallelism, then it was Hadoop, then Spark. Now I can write Julia code that is agnostic over the CPU or GPU and vastly outperforms for the same types of jobs, and where I can run the same code on my laptop or a cluster. It's a huge advance for my domain! I agree it probably doesn't apply to most engineers though.

lumost · on April 9, 2021

Even in NLP/Ad-tech/CV spaces a lot of the time is spent in cpu/io bound tasks such as featurization and reading datasets off of disk and shuttling them to GPU. On my most recent model training jobs in TF w/16GB of working memory I sit at an average of ~40% GPU utilization.

Some of this overhead is language specific, and some is due to shitty code. Never the less I'd bet if I didn't need to shuttle memory to GPU or could do multiple things at a time I'd crush my perf number. ( noting of course that the 40% gpu util is roughly 10x better than a CPU )

SatvikBeri · on April 9, 2021

Interesting, that makes sense. This inspired me to run some checks. The CPU version of the pipeline I'm working on spends about 90% of its time with ~100% CPU utilization in a massively parallel process and 10% of its time on other stuff, mostly io. With the GPU making the massively parallel part ~10x faster, io is now the dominant portion of my code – Amdahl's law in action!

sdenton4 · on April 9, 2021

Tensorflow datasets are pretty great, once you get them really rolling. They do a great job of scaling out worker threads for various parts of the featurization process, keeping an arbitrarily big cache of batches ready to go in ram, etc.

https://www.tensorflow.org/api_docs/python/tf/data/Dataset

lumost · on April 9, 2021

aye - the above job was done with TF datasets. They are limited in that the python gil requires multiprocessing, multiprocessing involves serialization in python, serialization involves dealing with the rather extreme object overhead in python ( 24 bytes for an int! ).

Which all means there's a bunch of CPU bound stuff between your job and the GPU/Cuda kernels. How fast your app can deal with the above will influence overall GPU utilization.

FridgeSeal · on April 9, 2021

While I appreciate that Python is easy and flexible enough to write, it bugs me immensely every time I run into situations like this.

We go to all this effort to build and write stuff in this optimising, fancy framework only for the whole process to be bottlenecked by some silly performance limitation in Python.

sdenton4 · on April 9, 2021

It's usually possible to sidestep the python limitations with a bit of elbow-grease. The usual killer for performance is tf. py_function, which does indeed have to respect the GIL. If you can work out a nice way to handle your data without it, it should be able to stick to cpp in the backend and avoid the GIL. (So, data in a format that tf has a parser for, and write transformations using tf methods where you can.)

ummonk · on April 9, 2021

It's only going to be IO-bound because the GPU speeds up the computation so much.

01100011 · on April 9, 2021

> GPUs can only do 1 thing at a time

Wut? Modern GPUs(ahem, Nvidia) can pipeline very effectively. You can copy to and from the GPU while the GPU is busy working on something already loaded into memory. True, host copies suck, but you can hide them behind compute delays in many workloads.

With GPUDirect, you can even skip the CPU and DMA straight between the GPU and I/O(storage or network controller).

vlovich123 · on April 9, 2021

Or avoid shuttling the data back and forth and go with a unified architecture like the M1. Memcpy is terrible and yet we keep trying to build bigger and better pipes to make it faster with diminishing returns.

nwallin · on April 9, 2021

> Or avoid shuttling the data back and forth and go with a unified architecture like the M1.

AMD has been shipping unified architectures since August 2011. Nobody cares.

The problem is that while a small low power CPU and a small low power GPU living on the same die and sharing the same memory controller will happily share memory, a big honkin' CPU and a big honkin' GPU don't like to share a memory controller. Some of the big advancements in absolute performance in recent years stems from making the memory controller closer and closer to the relevant processor. In recent AMD systems, the pins on the CPU are directly wired to the pins on the RAM DIMMs. Relatively old mother boards shipped when DDRx memory was the standard, and were still compatible with DDR(x+1) CPUs and RAM when those became the new standard; the motherboard has no RAM logic. If you wanted a unified architecture, you're either going to make the CPU talk to RAM through the GPU, or the GPU talk to RAM through the CPU, or go back to the early 2000s and put a north bridge on the motherboard.

In addition to all that, the bottlenecks are different. The bottleneck on a GPU is always, 100% of the time, memory bandwidth. Latency is mostly irrelevant. The bottleneck on a CPU is usually (95%? 99%?) the memory latency. There's no such thing as RAM that will perform well for both a CPU and a GPU.

lumost · on April 9, 2021

I strongly suspect that this is Nvidia's next big move. A giant CPU/GPU hybrid with unified memory. Who doesn't want threadripper scale Nvidia/ARM cores?

tremon · on April 9, 2021

Yes, unified memory being the key here. FAFAIK GPUs still use a hub-and-spoke model, where all processing is done in the hub but data is stored on the rim. They achieve their speedup by increasing the number of spokes (data channels). You could do better by moving the data processing to the spokes as well, and using the hub only for orchestration and synchronization.

At least, it's computationally better. But it won't allow scaling the processing power and memory size independently, so it might require a different commercial model.

einpoklum · on April 9, 2021

I would suggest a different metaphor, since these days, CPUs also have a bunch of cores:

* GPUs invest more chip real-estate in integer and floating point crunching (ALU, FP).

* CPUs invest more chip real-estate in control logic, prediction and speculation.

Not that these are the only hardware differences of course, like how caching works, and larger vs higher-bandwidth memory etc.

tapirl · on April 9, 2021

CPUs will have more and more cores to reduce the gap.

Maybe future CPU and GPU and TPU will merge as a new kind of compute unit.

rasz · on April 9, 2021

>Multi-channel audio, and sound effects on CPUs

We only got here because Microsoft was afraid of one vendor, Creative, monopolizing positional audio market. This led to killing off audio hardware acceleration in Vista and leveling the playing field.

_0ffh · on April 9, 2021

Only it's not purely specialty hardware in this case. It's a massively parallel computation subsystem that happens to be also able to generate a video signal.

This really makes sense in a way. I once read in a paper that the best thing to have when faced with the task of performing an arbitrary calculation with maximal speed is 1) one core that's a fast as possible (for stuff that must be calculated sequentially) and 2) as many as possible slower cores (for stuff that can be calculated in parallel).

stjohnswarts · on April 9, 2021

That's not how this works. Dedicated hardware for video games won't be on a general GPU anytimes soon. Neither will those problems that FPGA is solving when the matrices and properties of the calculation are massively parallel and static (or pseudo static require little change). Same with lidar processing or numerous other end uses. Dedicated hardware (in the form of FPGA's and ASICS) is everywhere.

gpderetta · on April 9, 2021

AKA the Wheel of Reincarnation.

I do think that GPUs are here to stay for a long time though.

mcosta · on April 9, 2021

Yes, but how much until integrated beats separate card in $/performance?

Once I bought a separate i387 chip for floating point. Later it was inside the main CPU.

wongarsu · on April 9, 2021

Both CPU and GPU are in large part limited by power delivery and heat. Putting them both in one chip tends to make both of them worse. I wouldn't count on that reversing any time soon.

It is however curious how CPU and GPU become more similar over time, with CPUs getting ever wider SIMD instructions and GPUs becoming better over time with branching code, integer performance, etc.

gpderetta · on April 9, 2021

There is some convergence, but CPUs are still latency optimized and GPUs are throughput optimized which makes for vastly different architectures.

rbanffy · on April 9, 2021

> Putting them both in one chip tends to make both of them worse.

You don’t need to do that for them to share a single memory address space. Putting them together would be useful if sharing caches between them made sense, but it doesn’t.

monocasa · on April 9, 2021

Interestingly the later separate i487 chip was in fact a full 486 CPU and 487 FPU on one die, and disabled the 486 CPU in the main CPU socket. Integration really means something, even when the product SKUs aren't there yet.

Dylan16807 · on April 9, 2021

They integrated it, but you were still buying it. You just saved a couple dollars on packaging.

Similarly, an integrated GPU has a meaningful price advantage at low enough price tiers.

But GPU price and power budgets have been pretty steady for a long time, so I wouldn't expect a major shift any time soon.

gameswithgo · on April 9, 2021

Lots of jobs where a GPU is a factor of 1000x faster so...

jmartrican · on April 9, 2021

Genuine question here.... why are GPU's still used for graphics in video games? Is this one use case where the CPU will never catch up or has yet to?

SatvikBeri · on April 9, 2021

Oversimplifying a bit, CPUs are smart and can do a lot of different things, and a lot of their chip area is devoted to that intelligence. GPUs are "dumb" but their chip area is heavily devoted to absolutely enormous throughput on really big batches. Looking at AWS prices right now, I can pay ~4/hour for 50 4gHz CPU cores or 5,000 1.75 gHz GPU cores. That means the best case scenario for GPUs is ~2,000 times faster. In practice, you'll rarely achieve that, but GPUs can be tremendously faster for extremely parallelizable tasks (which graphics are.)

mewse · on April 9, 2021

Yes.. if you're doing something that is just doing the same calculation on a whole lot of different inputs, then GPUs absolutely eat that for breakfast.

In my game, I have a complex procedural generation process that occurs while loading a game. It's not a graphics process, so I originally did it on the CPU. It originally took about three seconds to build the data in parallel across seven background CPU threads on my quad-core processor. But testers who were using low-end dual-core i5s only had one background thread to do that same calculation, and typically reported that the procgen took multiple minutes to complete.

After spending a week refactoring the algorithm to do the same calculation on the GPU instead of the CPU (basically by pretending it was a rendering calculation and writing results out into a "texture" that we could read the results from), calculation times dropped from seconds or even minutes to just fractions of a millisecond, even on low-spec machines.

The calculation that I previously had to hide behind a loading screen was now quick enough that I could freely do it at runtime without even causing a blip to the frame rate. If you've got a problem that they can handle, GPUs are kind of astonishingly fast; even the (by modern standards) low-end ones.

sumtechguy · on April 9, 2021

Very true!

If you look at what most graphic pipelines for 1 pixel the amount of calculations for that 1 pixel is not very much maybe a few dozen instructions. But at 1080 that is a lot more (about 2million times more). GPUs are exceedingly good at doing semi small programs over and over across 2000+ compute units. At best in a CPU you may get 64 if you have a super nice top of the line CPU (reality is 2 or 4). The graphs where that change over happens is going to vary considerably across workloads and instructions used. In most cases currently it heavily favors the GPU. Throw in branching or something like that and CPU might become more favorable. But you still have to try it out.

In the case of this article. They are using hashing/caching which, yeah, should produce a fairly nice speedup. Basically the old speedup trick of do the work once and keep the result. But that probably might not translate very nicely GPU. Oh you could get it to run but it may not be as performant. In the game world it would be like what we used to do with sin/cos and just have a lookup table instead of calling the instruction. We just precalled it and had a copy laying around in an array for the most common cases. So it was just a memory lookup and very little compute and keeping the cached result. BUT that does come at a cost if you have to branch on miss.

Now if you could combine the two ideas. Maybe with some sort of mask to the GPU to say 'do not do any work here as it is done already and work on something else' and pre fill stuff in this could be an even more interesting idea.

nwah1 · on April 9, 2021

True. Of course the need for all that parallel processing can sometimes be replaced with intelligence, such as the new video codecs that leverage machine learning. Maybe CPUs will be better for video if those codecs start to dominate.

fulafel · on April 9, 2021

What you mean by GPU cores here?

In their marketing terminology, the jesters at NVidia are calling SIMD lanes "cores".

For F32 operations, 50 AVX512 cores have 60*(512/32)=800 SIMD lanes. And of course there are about 64 cores per x86 server CPU now at the high end.

SatvikBeri · on April 9, 2021

By cores, I mean what NVidia calls "NVidia CUDA Cores", e.g. there are 5,120 on the Tesla V100.

fulafel · on April 9, 2021

Ok. So translated to CPU terms there would seem to be 80 cores ("SPs") in a V100, that each can do up to 64-lane SIMD/SIMT with FP32 data. [1]

(SIMT for NVidia seems to be just a programming model that compiles to SIMD instructions[2], a bit like what you get with ispc on CPUs)

[1] https://images.nvidia.com/content/volta-architecture/pdf/vol... pages 17 and 10

[2] https://www.realworldtech.com/forum/?threadid=195094&curpost...

SatvikBeri · on April 9, 2021

Yeah, although a "Streaming Multiprocessor" is still less general than a CPU core IIUC.

netflixandkill · on April 9, 2021

The geometry calculations across matrices for 3-dimensional euclidean objects are both relatively easy to do directly in hardware and repeated in only slightly varying ways for the various rendering stages.

Given the increasing sizes of framebuffers and textures it makes sense to localize that, although many CPUs now include a GPU equivalent on the same die and many of the same sort of instructions.

A lot of it comes down to how big and independent the working sets are as opposed to strict advantage in computation as well.

kolinko · on April 9, 2021

CPUs didn’t swallow GPUs though.

zelon88 · on April 9, 2021

Yup. Controversial opinion incoming, but I actually can't wait for the same thing to happen to ARM. The RISC crowd so quickly forgot the purpose of the RISC architecture. Every time I hear someone tell me how the M1 is so fast compared to x86 I just get this picture in my head of a guy in a 1992 Civic DX with a 2 foot wing revving his engine at the Lambo parked next to him. [1]

[1] https://wccftech.com/why-apple-m1-single-core-comparisons-ar...

nl · on April 9, 2021

This article is... misguided.

They argue that we should be benchmarking saturated single-core performance, but then show in their chosen benchmarks that the M1 is very competitive, benchmarking just below an i9 9880and above a Ryzen 7 4750.

That's a super-fast chip! No excuses like "oh it's Apple's first CPU" are needed - it's right up there.

And in real world usage it's kind of a useless thing to benchmark. See their note on how they had to wrestle with our Cinebench R23 program to get it to accept the load (threads locked to 2, affinity needed to be set after initiating the run but before the benchmark actually started and needed to be reapplied after the first pass) and a cleaner execution would almost certainly be welcome. It would also allow us to test cores working at their full potential.

It kind of shows how these micro-benchmarks aren't very reflective of real-world use.

Mistletoe · on April 9, 2021

Lol what? Can you explain this view more fully? If anything the x86 currently has the two foot spoiler on it.

zelon88 · on April 9, 2021

Sure! I'll keep digging. >D

RISC = Reduced Instruction Set Computer. Reduced instructions result in (supposedly) reduced complexity per instruction (which has failed over time). While each instruction is "simpler" it takes more instructions to accomplish a task. It would stand to reason that it would take more clock cycles to accomplish the same goal on ARM than x86. [1]

ARM processors regularly outperform x86 in terms of instructions per cycle. They have to in order to do the same amount of work. That is literally the whole idea of RISC. Simpler instructions executed quickly. [2]

But this is not why they are fundamentally not as capable. The reason for that is the pipeline. Or relative lack of pipe-lining. [3]

But again, you are comparing a Civic to a Lambo. You have a 3.5w CPU which, per watt, you have the more efficient processor. But that's not good enough. ARM folks love to stand on the efficiency soapbox and preach about performance. You want to have the flexibility of out-of-order execution but don't want to admit the shortcomings that come with it. You have the best performance per watt but you want best overall performance. Which, if you skew every metric in favor of sliced up processors and non-real world benchmarks then sure... Your phone is almost as fast as an entry level laptop when running native code on a single core. If x86 appears slower than ARM it's only because ARM fans have thrown everything but the kitchen sink into their test bench and found the absolute most favorable conditions possible. Anything but admit that their $1200 cell (which was never designed to be faster than your laptop) is "faster per watt per core per thread" than a $150 x86 processor.

[1] https://www.androidauthority.com/arm-vs-x86-key-differences-...

[2] https://www.extremetech.com/computing/318020-flaw-current-me...

[3] http://www.csbio.unc.edu/mcmillan/Comp411F17/Lecture26.pdf

tremon · on April 9, 2021

All current Intel/AMD processors (since Intel Core/AMD K6) are RISC cores with a complex instruction decoder in front of them. There's more to a processor core than the instruction set, and both pipelining and out-of-order execution have nothing to do with the instruction set architecture. There have been in-order x86 processors (Transmeta and Via, Intel before the Pentium Pro), and there are plenty out-of-order RISC processors: POWER4 and up, SPARC T4, RISC-V, many ARM Cortex models.

mhh__ · on April 9, 2021

I think you're wrong but I upvoted because the article does raise a valid point.

ummonk · on April 9, 2021

That article is incredibly stupid. From a user perspective, “single core” performance is irrelevant. What matters is single thread performance, and multi thread performance of the entire processor.

king_magic · on April 9, 2021

I call BS until a paper & repo is released.

exegete · on April 9, 2021

https://arxiv.org/abs/1903.03129

https://github.com/keroro824/HashingDeepLearning

yor1001 · on April 9, 2021

44 Cores, and careful cache management. I'm wondering whether the GPU implementation has been similarly optimized.

eyegor · on April 9, 2021

They presented an approach which was not cache optimized which also showed significant gains. And of course they used a lot of cores, they're comparing against a v100, which is a $7k gpu (maybe $5k if you're lucky).

guywhocodes · on April 9, 2021

Our GPU implementations are a lot more optimized. This is pretty far behind MKL etc for matrix multiplication

rsfern · on April 9, 2021

It’s 44 threads (22 cores). They also compare with tensorflow cpu compiled with SIMD instruction sets on the same hardware.

What other optimizations would you like to see? I would expect the tensorflow team to already pay pretty close attention to performance in cpu and gpu implementations, not to mention CUDNN and such...

foerbert · on April 9, 2021

Considering the GPU implementation is TensorFlow, I think it's very safe to assume the GPU implementation is the far more optimized one.

d110af5ccf · on April 9, 2021

Is there anything preventing the same or a similar algorithmic optimization from being implemented on the GPU though? IIUC, the new algorithm (on the CPU) was compared to an existing algorithm (both on the CPU and GPU).

taf2 · on April 9, 2021

Interesting the golang version https://github.com/nlpodyssey/goslide

king_magic · on April 9, 2021

I stand corrected!