This is the initial SLIDE paper, this is supposed to be an improvement on top of SLIDE. Having looked at and run the original implementation I'm a lot less skeptical than the average post here. It was quite fast already and should be considered a toy implemenation.
I think the reason you're getting so many questions about the 'toy' statement is that often toy programs are simple in large part because they omit handling complex or difficult cases. I don't think I've ever heard the term applied to a complete problem that is merely implemented naively.
As a result, saying it's a toy implementation is making people think you mean the speedup is a result of simply handling a simple portion of the problem, rather than thinking you mean even a naive implementation is quite fast.
> this original was a toy/PoC and this new paper is much better
This one hundred percent, the original has very few optimizations over a completely naive implementation. It uses MPI and huge pages, that's essentially it.
After a quick read of the paper, no. You could adopt this to the GPU (which would require the hashes work on groups of neurons instead of individuals) and might get a similar speedup. Locality sensitive hashing in fact seems like a primitive attention mechanism, with proper attention implementation you could get maybe even better results.
The title is a bit misleading as this algorithm is for feedforward networks and doesn't yet support convolutional layers or any of the SOTA techniques for image classification... which is why GPUs reign supreme for training deep neural nets.
NeRF is a good example of a network that doesn't have convolutions yet requires a ton of iterations to train. This paper is particularly relevant to wide networks which are important because CPU memory is currently much cheaper than GPU memory (even for FANG researchers!).
Interesting, I didn't know that NeRF was simply a feedforward network.
I hope that this research group can make more headway into training on CPUs, but I also would like to (naively) see less hyperbolic titles. This paper is not just particularly relevant to wide networks - it's only relevant to wide networks.
This is why I come to HN, to find out why it doesn't work in the general case. I can always count on you guys to point out why something is an evolutionary change rather than revolutionary
The original paper includes convolutional layer support in their future work & next steps. But it's not a foregone conclusion that the same speedup will occur.
What? No. Fully connected networks are deep learning, and actually the most important deep learning workload. See: Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective:
Transformers, which are currently waging a successful campaign to conquer all Deep Learning, are largely stacked feed-forward networks, matrix multiplies and maps. Some ideas to make attention more scalable, such as LSH or large sparse attention matrices seem like they'd be well suited to this approach.
Their approach should also be readily adaptable to RNNs, including LTSMs.
Certainly worth investigating as an alternative for efficiently running and training giant networks on less expensive hardware.
>"Deep neural networks (DNN) are a powerful form of artificial intelligence that can outperform humans at some tasks. DNN training is typically a series of matrix multiplication operations, an ideal workload for graphics processing units (GPUs), which cost about three times more than general purpose central processing units (CPUs).
"The whole industry is fixated on one kind of improvement—faster matrix multiplications," Shrivastava said. "Everyone is looking at specialized hardware and architectures to push matrix multiplication. People are now even talking about having specialized hardware-software stacks for specific kinds of deep learning. Instead of taking an expensive algorithm and throwing the whole world of system optimization at it, I'm saying, 'Let's revisit the algorithm.'"
Shrivastava's lab did that in 2019,
recasting DNN training as a search problem that could be solved with hash tables.
Their "sub-linear deep learning engine" (SLIDE) is specifically designed to run on commodity CPUs, and Shrivastava and collaborators from Intel showed it could outperform GPU-based training when they unveiled it at MLSys 2020."
PDS: Quote: "All programming is an exercise in caching." - Terje Mathisen
This is currently #1 on the front page, but I think HN is celebrating the defeat of GPUs prematurely. The architectures used in this paper have just one hidden activation layer. It remains to be seen whether the ideas in SLIDE are applicable to general deep learning tasks and architectures.
From the previous paper:
We choose the standard fully connected neural
network with one hidden layer of size 128.
Yep, I feel like anyone realizes when they first start playing around with NNs in a jupyter notebook that you get faster training on the CPU until you level up to bigger networks.
That's normally due to the cost of copying the data to GPU space. Here, they've reformulated the training algorithm to be a less-computationally-complex problem. It's a huge difference.
It's not the first similar claim, there was one that large CPU's can beat GPUs by creating a very sparse matrix using spatial hashing. Preserving the gradient but finding it with vastly less computation by simply not multiplying elements that don't have much or any impact.
This is literally the same research group, and the paper you link to is referenced in the article as being the predecessor 2020 publication to this announcement of a forthcoming 2021 paper expanding on the 2020 one.
So yes and no - if you are correct and that paper is indeed the first claim (not up on the literature), this one is also the same first claim, or at least, an extension to it
There are basically no details in the article or the linked one (inside the article). I'll keep an open mind but this is the sort of article that should make you roll your eyes. (◔_◔)
It's legit. For the network types this works on, it's much less computationally complex. A lot of the networks we use now are based on the assumptions of fast matrix multiplication, so it's not universally applicable -- but there's a ton of networks this does work on, and there's a good indication that we can rephrase a model to solve the same problems in the hashtable space.
If I understand the paper correctly it's because GPUs are good at doing a lot of parallel matrix multiplications, at the cost of copying to and from GPU memory. The CPU has the advantage that it doesn't need to copy, and it's more flexible, but not as parallel, so if you use caching and limit the parallelism of the computations by exploiting sparsity the CPU wins as it does here.
So I don't see the clickbaitiness here, it's the central claim of the paper.
If you are training on the GPU, the bulk of the weight data never leaves the GPU in the first place.
It's a fundamentally different programming style, which makes comparisons extremely difficult.
Furthermore: you hide latency on the GPU by running more and more parallel instances. Yeah, there is a 5 to 10 microsecond delay in all CPU to GPU comms and back, but just double, triple, or 100x fold the parallelism and batch up more tasks (aka Gustafson's law).
Don't run one SLIDE. Run 100x of them in parallel to hide the latency. AMD Vega 64 has 16384 hardware threads (one instruction every 4 clock ticks across 4096 hardware SIMD cores) with up to 10 occupancy. That's 163840 max hardware SIMD threads of conceptual execution. (Occupancy is similar to hyperthreads/SMT).
In practice, you run out of VRAM before running out of threads typically. 1000x instances uses 1000x the RAM.
---------
Anyway, I've seen an entire generation of programmers try to make CPU only algorithms and only fail year after year. The cryptocoin community would generally prefer CPU mining rather than GPU or ASIC mining.
But time and time again, the algorithms adapt and someone is brilliant enough to make the new alleged CPU algorithm work really well on a GPU.
It's hard to compare in general but it's not hard to compare when you can take the same problem and run it in the "blackbox" of each system and compare performance. That's actually really easy. Now whether you can say "in general" would require much more varied comparisons than just a couple in an academic paper. Most people don't care what's in the black box.
I mean, I can invent an algorithm that runs poorly on x86 CPUs but good on GPUs rather easily.
Ex: GPUs have a bit-reverse instruction, but x86 is missing that instruction. I've invented hash algorithms that only work well on GPUs (or really, architectures with single-cycle bit-reverse) and will run poorly on CPUs.
----------
But when I run the idea on a CPU, I don't do bitreverse. I use bswap64 instead. Because x86 has single-cycle bswap. Its technically a different algorithm with a different result, but bswap shuffles the bits around enough that the hash-algorithm is still really fast and really good (passes a lot of randomization tests I've done).
----------
We're looking at instruction-level differences between two grossly different platforms. Anyone who is an expert at both systems will tell you that optimization and tuning for both systems are so grossly different, that its pretty much non-sensical to try to compare them in "black-box" manners. You need to put some level of effort into processor-level tuning if you want the results to stand.
My understanding is that this works better because it is faster to hash their data (and compare with prehashed weights) rather than multiply the data and calculate activation functions.
I imagine this could work for more layers easily,and using other than fully connected layers could be done by partitioning data (for each set of inputs).
The next obvious question is, can their hashing function run even faster on a GPU?
They say:
>In particular, SLIDE uses recently proposed
DTWA hashing which works nicely on sparse data. If we
represent the data vector as a list of indices and values.
DTWA operation computes a random hash map of every
non-zero index in the list. Then the map is partitioned into a
fixed number of bins. Finally, the index with the maximum
coordinate value in each bin is the hash value.
What is the "coordinate value"?
And how does the above map to find instances where data and weights values would result in a value above the activation threshold?
They themselves mention it "works well on sparse data".
If they detect irrelevant data by simple value comparison(wrapped in that hashing function) couldn't GPU algorithm be sped up by skipping those multiplications too? How about instead of running matrix multiplication on entire data and all weights we "preprocess" data and weights finding specific indices where they are both above a certain value and multiply just those. It would be interesting to try that. How expensive would running such comparison, then data copy (or create a map of indices to process), then process only those chosen be in comparison with multiplying everything every time?
I'm guessing that it really depends on what percentage of data is actually relevant.
Another idea to make GPU AI faster is improving scheduling of those matrix multiplication. Multiplication by zero, another small number or a power of 2 should be able to execute faster than multiplication of high values (assuming it is implemented in hardware by addition and bit shifting if multiplying by powers of 2). I don't know if current GPU algorithms make any use of that, or if they simply divide the data by number of cores and run all chunks in parallel for however long the longest chunk takes to complete?
I think GPUs are pretty far from being dead in AI as a result of this.
> Study co-author Nicholas Meisburger, a Rice undergraduate, said "CPUs are still the most prevalent hardware in computing. The benefits of making them more appealing for AI workloads cannot be understated."
I don’t think this statement means what he thinks it means. I think “cannot be overstated” is what is desired.
It all depends on economics, what you have now doesn't really matter a couple years down the line when the simple progress of time will show if this research has been used and built upon or not. 44+ cores, that's already available for workstations, and with the 4th generation of Threadripper (that's supposed to be shipping this year), that amount of cores is probably going to be even cheaper than it is now. On the other hand, GPUs are absurdly expensive right now as a result of multiple factors (people being home and gaming due to covid, crypto miners, lower supply of parts due to covid), but we'll see in a year.
Also, I'm pretty sure the goal of this research was not that you can train a DNN on your laptop, but that there is more democratization in the distribution of training power. Maybe some day a mid sized organization will be able to train a competitive model on a buch of off the shelf (or rented) servers...
If an algorithm runs faster on CPU than GPU, then it's only a matter of time before it gets optimized for GPU and at the end of the day will still run faster on GPU.
At the very least, generic types of parallel training are applicable to these models (e.g. an ensemble of models that see batches in a different order or are initialized differently).
Researcher time: Data munging, apples-to-apples model evaluation, identifying and fixing broken assumptions only detectable due to subtle model behavior.
Sure, but I assumed he was referring to the biggest computational bottleneck. All of those others are just tooling problems or data quality problems. For DNNs the hours-to-weeks-long training times make it hard to iterate manually or do any kind of rational optimization of architecture and hyper parameters.
Sure, if we just had stationary, perfect data and perfect objective functions it's only computational, operational, monitoring, and maintenance complexity that holds us back.
Interesting. As I know, Stockfish [1] chess engine exploits NNUE nets optimized for CPU and regularly beats up Leela [2], which explots traditional GPU-based networks based on AlphaZero architecture.
Reminds me of https://en.wikipedia.org/wiki/Cell_(microprocessor) and how that was considered a bad direction for game AI and general game depth, at least some were proclaiming it as such, that it would lead to prettier, but dumber games. Didn't make much sense to me at the time and in general I think number cruncher cores were a boon for AI.
Game AI, being simple symbolic AI, has really branchy unpredictable instruction streams. The SPEs were really more optimized for GPU like tasks with clean loops where you can gang together a few SIMD lanes and process them all at the same time.
All of that being said, I've heard from a few PS3 devs that the simple PPEs didn't really do a great job with that kind of AI code either, due to their extremely simple core designs including their branch predictors. So it didn't end up being a concern and a lot of AI ended up running on the SPEs just because they could be laying around with nothing better to process.
At the end of the day, game AI wasn't really held back by a particular console's design.
Turning back to nineties, it's easy to see that almost all ventures into specialty hardware ended up with mainstream consumer CPUs catching up, and swallowing the niche in a few years time as computer science, and logic design advanced.
Very few computing tasks came out to be truly brute force demanding as people learned how things really work.
Multi-channel audio, and sound effects on CPUs — once thought to be impossible, are now everywher, even smartphone chips.
Getting there was very tough though. Very few people can write a software DSP, and accoustically correct mixers with real time performance even today.
Reading comments in this thread, it feels like people still can't believe that in some case neural networks can be faster on CPU instead of GPU.
In fact, there is already real world use for neural networks that only optimized for CPU: an chess engine. More specifically, chess engine that use NNUE (Efficiently Updated Neural Networks) [1] like Stockfish 12 [2]. It run much faster while consuming less watts compared to GPU, can be run on average CPU, and managed to beat GPU based neural networks! [3]
This model (NNUE) already exist far earlier than the model discussed in this thread, yet there is almost no discussion about it on HN nor Reddit's r/MachineLearning
NNUE is weird. No one outside the chess/shogi community talks about it because they seem to have a very case of strong not-invented-here syndrome. It's hand-optimised CPU code that doesn't (yet) run on the GPU (which is why it is "faster").
To be fair, they do want to embed it into a consumer friendly application, and the integration for embedding TF or something that can run pytorch models on a GPU without python is non-trivial.
There is a PyTorch port available, but no benchmarks unfortunatly. It does seem to be fairly widely used for training though, which is indicative of the speed gains available.
GPU have high latency. For a chess engine like stockfish which is designed to search as many positions as possible, the latency of a GPU is a big problem.
Engines like LC0 that do use the GPU work by searching fewer positions but with a heavier eval function. This makes the latency less relevant because it is a smaller percentage of the GPU time.
Board game AI works by searching through the state space, and evaluating each state with the neural network, then picking the move with best expectation.
So it needs to load the comparatively tiny game state (chess board) into the GPU for each evaluation. The more game states it can evaluate per move, the better it is. It can be in the order of millions.
Was the NNUE trained on the CPU though? It’s intuitive that with the small size of a chessboard and the ability to use incremental update of evaluations, it will be faster to do neural network evaluation on CPU. Training, on the other hand, is typically highly parallelizable, so it would be a big development if we are able to do it faster on CPU.
You will find most developers don't actually understand how fast a single x86 core can be when used appropriately. Most are too busy hiding from real hardware concerns to think about such things.
> almost all ventures into specialty hardware ended up with mainstream consumer CPUs catching up
All except for... 3D rendering, image compositing, video encoding, and video decoding?
I'm really struggling to see what you're trying to argue, because we do have multiple meaningful brute-force tasks that most computers today ship with dedicated silicon for.
Yes we can do more real-time audio processing I guess, but that's mainly because CPU speeds increased, not because "people learned how things really worked", or am I missing something?
And the opposite too: "mainframes are too complicated; we'll write an OS that does all IO in the kernel (e.g. Unix) and run it on minis that don't have channel controllers". Come the 2000s and every disk drive and even keyboard has a processor it it.
And edge-vs-core computing in the network...
The best part is hardly anyone reads the literature much less history, so if you've lived through a couple of cycles you can see the pendulum starting to swing back and, to mix metaphors, "skate to where the puck will be"
Amen to that. I do welcome the return of "spread out" computing. I agree ease of use is important, but putting everything the CPU microcomputer style makes for a very imperative notion of ease of. Ergonomics without that would be much more elegant.
Random thought: computing may very well be spread out one day if computational machinery is cheap to build but limited by heat. I could see computers becoming much, much larger, but with a smaller duty cycle for a particular compute element to assist with thermal management. Imagine building structures, like tables, walls, or skyscrapers, out of nothing but doped silicon.
It's interesting because in some ways, GPUs are much less specialized than CPUs – GPGPU programming mostly involves treating the GPU like a huge, dumb collection of very simple processors. I expect effects in the opposite direction as well, i.e. that programming languages get better at using GPUs to do a wider set of workloads. For example, I was recently able to prototype an ETL pipeline with no machine learning or matrix multiplication to a GPU and get a ~10x price/performance improvement, which would have been a lot harder even 5 years ago.
Yeah I disagree with the comparison of GPGPUs to specialized hardware. It’s more a distinction between hardware optimized for throughout of general highly parallel computations and hardware optimized for latency of highly serial computations.
In practice it's pretty tough to make this work. If your ETL can't be turned into just math on a <12GB dataset then you will be contending with streaming data from io -> CPU -> GPU -> CPU efficiently. CPU's are really good at multi-threaded io given an evented runtime these days, but GPUs can only do 1 thing at a time. This means that your application will need to batch data to the GPU.
So the real race is between how good you can be at batching vs. parallel dispatch when reading disk from I/O. As parallel dispatch is a more general problem it tends to get more attention.
"Math on datasets that can be parallelized" is a pretty huge swathe of use cases in the Data Science/Data Engineering world, I've probably spent at least 30% of my time on similar problems.
At my first job we used some special SAS product to handle data larger than memory without getting much parallelism, then it was Hadoop, then Spark. Now I can write Julia code that is agnostic over the CPU or GPU and vastly outperforms for the same types of jobs, and where I can run the same code on my laptop or a cluster. It's a huge advance for my domain! I agree it probably doesn't apply to most engineers though.
Even in NLP/Ad-tech/CV spaces a lot of the time is spent in cpu/io bound tasks such as featurization and reading datasets off of disk and shuttling them to GPU. On my most recent model training jobs in TF w/16GB of working memory I sit at an average of ~40% GPU utilization.
Some of this overhead is language specific, and some is due to shitty code. Never the less I'd bet if I didn't need to shuttle memory to GPU or could do multiple things at a time I'd crush my perf number. ( noting of course that the 40% gpu util is roughly 10x better than a CPU )
Interesting, that makes sense. This inspired me to run some checks. The CPU version of the pipeline I'm working on spends about 90% of its time with ~100% CPU utilization in a massively parallel process and 10% of its time on other stuff, mostly io. With the GPU making the massively parallel part ~10x faster, io is now the dominant portion of my code – Amdahl's law in action!
Tensorflow datasets are pretty great, once you get them really rolling. They do a great job of scaling out worker threads for various parts of the featurization process, keeping an arbitrarily big cache of batches ready to go in ram, etc.
aye - the above job was done with TF datasets. They are limited in that the python gil requires multiprocessing, multiprocessing involves serialization in python, serialization involves dealing with the rather extreme object overhead in python ( 24 bytes for an int! ).
Which all means there's a bunch of CPU bound stuff between your job and the GPU/Cuda kernels. How fast your app can deal with the above will influence overall GPU utilization.
While I appreciate that Python is easy and flexible enough to write, it bugs me immensely every time I run into situations like this.
We go to all this effort to build and write stuff in this optimising, fancy framework only for the whole process to be bottlenecked by some silly performance limitation in Python.
It's usually possible to sidestep the python limitations with a bit of elbow-grease. The usual killer for performance is tf. py_function, which does indeed have to respect the GIL. If you can work out a nice way to handle your data without it, it should be able to stick to cpp in the backend and avoid the GIL. (So, data in a format that tf has a parser for, and write transformations using tf methods where you can.)
Wut? Modern GPUs(ahem, Nvidia) can pipeline very effectively. You can copy to and from the GPU while the GPU is busy working on something already loaded into memory. True, host copies suck, but you can hide them behind compute delays in many workloads.
With GPUDirect, you can even skip the CPU and DMA straight between the GPU and I/O(storage or network controller).
Or avoid shuttling the data back and forth and go with a unified architecture like the M1. Memcpy is terrible and yet we keep trying to build bigger and better pipes to make it faster with diminishing returns.
> Or avoid shuttling the data back and forth and go with a unified architecture like the M1.
AMD has been shipping unified architectures since August 2011. Nobody cares.
The problem is that while a small low power CPU and a small low power GPU living on the same die and sharing the same memory controller will happily share memory, a big honkin' CPU and a big honkin' GPU don't like to share a memory controller. Some of the big advancements in absolute performance in recent years stems from making the memory controller closer and closer to the relevant processor. In recent AMD systems, the pins on the CPU are directly wired to the pins on the RAM DIMMs. Relatively old mother boards shipped when DDRx memory was the standard, and were still compatible with DDR(x+1) CPUs and RAM when those became the new standard; the motherboard has no RAM logic. If you wanted a unified architecture, you're either going to make the CPU talk to RAM through the GPU, or the GPU talk to RAM through the CPU, or go back to the early 2000s and put a north bridge on the motherboard.
In addition to all that, the bottlenecks are different. The bottleneck on a GPU is always, 100% of the time, memory bandwidth. Latency is mostly irrelevant. The bottleneck on a CPU is usually (95%? 99%?) the memory latency. There's no such thing as RAM that will perform well for both a CPU and a GPU.
I strongly suspect that this is Nvidia's next big move. A giant CPU/GPU hybrid with unified memory. Who doesn't want threadripper scale Nvidia/ARM cores?
Yes, unified memory being the key here. FAFAIK GPUs still use a hub-and-spoke model, where all processing is done in the hub but data is stored on the rim. They achieve their speedup by increasing the number of spokes (data channels). You could do better by moving the data processing to the spokes as well, and using the hub only for orchestration and synchronization.
At least, it's computationally better. But it won't allow scaling the processing power and memory size independently, so it might require a different commercial model.
We only got here because Microsoft was afraid of one vendor, Creative, monopolizing positional audio market. This led to killing off audio hardware acceleration in Vista and leveling the playing field.
Only it's not purely specialty hardware in this case. It's a massively parallel computation subsystem that happens to be also able to generate a video signal.
This really makes sense in a way. I once read in a paper that the best thing to have when faced with the task of performing an arbitrary calculation with maximal speed is 1) one core that's a fast as possible (for stuff that must be calculated sequentially) and 2) as many as possible slower cores (for stuff that can be calculated in parallel).
That's not how this works. Dedicated hardware for video games won't be on a general GPU anytimes soon. Neither will those problems that FPGA is solving when the matrices and properties of the calculation are massively parallel and static (or pseudo static require little change). Same with lidar processing or numerous other end uses. Dedicated hardware (in the form of FPGA's and ASICS) is everywhere.
Both CPU and GPU are in large part limited by power delivery and heat. Putting them both in one chip tends to make both of them worse. I wouldn't count on that reversing any time soon.
It is however curious how CPU and GPU become more similar over time, with CPUs getting ever wider SIMD instructions and GPUs becoming better over time with branching code, integer performance, etc.
> Putting them both in one chip tends to make both of them worse.
You don’t need to do that for them to share a single memory address space. Putting them together would be useful if sharing caches between them made sense, but it doesn’t.
Interestingly the later separate i487 chip was in fact a full 486 CPU and 487 FPU on one die, and disabled the 486 CPU in the main CPU socket. Integration really means something, even when the product SKUs aren't there yet.
Oversimplifying a bit, CPUs are smart and can do a lot of different things, and a lot of their chip area is devoted to that intelligence. GPUs are "dumb" but their chip area is heavily devoted to absolutely enormous throughput on really big batches. Looking at AWS prices right now, I can pay ~4/hour for 50 4gHz CPU cores or 5,000 1.75 gHz GPU cores. That means the best case scenario for GPUs is ~2,000 times faster. In practice, you'll rarely achieve that, but GPUs can be tremendously faster for extremely parallelizable tasks (which graphics are.)
Yes.. if you're doing something that is just doing the same calculation on a whole lot of different inputs, then GPUs absolutely eat that for breakfast.
In my game, I have a complex procedural generation process that occurs while loading a game. It's not a graphics process, so I originally did it on the CPU. It originally took about three seconds to build the data in parallel across seven background CPU threads on my quad-core processor. But testers who were using low-end dual-core i5s only had one background thread to do that same calculation, and typically reported that the procgen took multiple minutes to complete.
After spending a week refactoring the algorithm to do the same calculation on the GPU instead of the CPU (basically by pretending it was a rendering calculation and writing results out into a "texture" that we could read the results from), calculation times dropped from seconds or even minutes to just fractions of a millisecond, even on low-spec machines.
The calculation that I previously had to hide behind a loading screen was now quick enough that I could freely do it at runtime without even causing a blip to the frame rate. If you've got a problem that they can handle, GPUs are kind of astonishingly fast; even the (by modern standards) low-end ones.
If you look at what most graphic pipelines for 1 pixel the amount of calculations for that 1 pixel is not very much maybe a few dozen instructions. But at 1080 that is a lot more (about 2million times more). GPUs are exceedingly good at doing semi small programs over and over across 2000+ compute units. At best in a CPU you may get 64 if you have a super nice top of the line CPU (reality is 2 or 4). The graphs where that change over happens is going to vary considerably across workloads and instructions used. In most cases currently it heavily favors the GPU. Throw in branching or something like that and CPU might become more favorable. But you still have to try it out.
In the case of this article. They are using hashing/caching which, yeah, should produce a fairly nice speedup. Basically the old speedup trick of do the work once and keep the result. But that probably might not translate very nicely GPU. Oh you could get it to run but it may not be as performant. In the game world it would be like what we used to do with sin/cos and just have a lookup table instead of calling the instruction. We just precalled it and had a copy laying around in an array for the most common cases. So it was just a memory lookup and very little compute and keeping the cached result. BUT that does come at a cost if you have to branch on miss.
Now if you could combine the two ideas. Maybe with some sort of mask to the GPU to say 'do not do any work here as it is done already and work on something else' and pre fill stuff in this could be an even more interesting idea.
True. Of course the need for all that parallel processing can sometimes be replaced with intelligence, such as the new video codecs that leverage machine learning. Maybe CPUs will be better for video if those codecs start to dominate.
The geometry calculations across matrices for 3-dimensional euclidean objects are both relatively easy to do directly in hardware and repeated in only slightly varying ways for the various rendering stages.
Given the increasing sizes of framebuffers and textures it makes sense to localize that, although many CPUs now include a GPU equivalent on the same die and many of the same sort of instructions.
A lot of it comes down to how big and independent the working sets are as opposed to strict advantage in computation as well.
Yup. Controversial opinion incoming, but I actually can't wait for the same thing to happen to ARM. The RISC crowd so quickly forgot the purpose of the RISC architecture. Every time I hear someone tell me how the M1 is so fast compared to x86 I just get this picture in my head of a guy in a 1992 Civic DX with a 2 foot wing revving his engine at the Lambo parked next to him. [1]
They argue that we should be benchmarking saturated single-core performance, but then show in their chosen benchmarks that the M1 is very competitive, benchmarking just below an i9 9880and above a Ryzen 7 4750.
That's a super-fast chip! No excuses like "oh it's Apple's first CPU" are needed - it's right up there.
And in real world usage it's kind of a useless thing to benchmark. See their note on how they had to wrestle with our Cinebench R23 program to get it to accept the load (threads locked to 2, affinity needed to be set after initiating the run but before the benchmark actually started and needed to be reapplied after the first pass) and a cleaner execution would almost certainly be welcome. It would also allow us to test cores working at their full potential.
It kind of shows how these micro-benchmarks aren't very reflective of real-world use.
RISC = Reduced Instruction Set Computer. Reduced instructions result in (supposedly) reduced complexity per instruction (which has failed over time). While each instruction is "simpler" it takes more instructions to accomplish a task. It would stand to reason that it would take more clock cycles to accomplish the same goal on ARM than x86. [1]
ARM processors regularly outperform x86 in terms of instructions per cycle. They have to in order to do the same amount of work. That is literally the whole idea of RISC. Simpler instructions executed quickly. [2]
But this is not why they are fundamentally not as capable. The reason for that is the pipeline. Or relative lack of pipe-lining. [3]
But again, you are comparing a Civic to a Lambo. You have a 3.5w CPU which, per watt, you have the more efficient processor. But that's not good enough. ARM folks love to stand on the efficiency soapbox and preach about performance. You want to have the flexibility of out-of-order execution but don't want to admit the shortcomings that come with it. You have the best performance per watt but you want best overall performance. Which, if you skew every metric in favor of sliced up processors and non-real world benchmarks then sure... Your phone is almost as fast as an entry level laptop when running native code on a single core. If x86 appears slower than ARM it's only because ARM fans have thrown everything but the kitchen sink into their test bench and found the absolute most favorable conditions possible. Anything but admit that their $1200 cell (which was never designed to be faster than your laptop) is "faster per watt per core per thread" than a $150 x86 processor.
All current Intel/AMD processors (since Intel Core/AMD K6) are RISC cores with a complex instruction decoder in front of them. There's more to a processor core than the instruction set, and both pipelining and out-of-order execution have nothing to do with the instruction set architecture. There have been in-order x86 processors (Transmeta and Via, Intel before the Pentium Pro), and there are plenty out-of-order RISC processors: POWER4 and up, SPARC T4, RISC-V, many ARM Cortex models.
That article is incredibly stupid. From a user perspective, “single core” performance is irrelevant. What matters is single thread performance, and multi thread performance of the entire processor.
They presented an approach which was not cache optimized which also showed significant gains. And of course they used a lot of cores, they're comparing against a v100, which is a $7k gpu (maybe $5k if you're lucky).
It’s 44 threads (22 cores). They also compare with tensorflow cpu compiled with SIMD instruction sets on the same hardware.
What other optimizations would you like to see? I would expect the tensorflow team to already pay pretty close attention to performance in cpu and gpu implementations, not to mention CUDNN and such...
Is there anything preventing the same or a similar algorithmic optimization from being implemented on the GPU though? IIUC, the new algorithm (on the CPU) was compared to an existing algorithm (both on the CPU and GPU).
Code: https://github.com/keroro824/HashingDeepLearning
Edit: Here is the new paper referred to https://arxiv.org/abs/2103.10891
Here is the new code: https://github.com/RUSH-LAB/SLIDE
The paper I linked above was the first paper of which this is supposed to be an improvement.