I'm a little confused here, are you saying that ML ASICs can't beat compute per $ of GPUs? That seems, on its face, to be a ridiculous assertion, so I'm confused where I'm misunderstanding you.
No, they can't when you can get tens of TFLOPS per GPU for <$1000 that comes with a solid software ecosystem for all the major AI frameworks out of the box. That's the power of the gaming/industrial complex: NVDA can fund the development of AI chips and software from the nearly bottomless pockets of millions of gamers and cryptocurrency miners. ASIC startups figuratively have one bullet in their gun and they have to hit the bullseye to live.
Now when a Tesla GPU costs $9-10K instead of <$1000, that's a somewhat different story, but even then, NVDA tapes out a brand new architecture almost annually. Good luck keeping up with that pace ASIC startups. And that's exactly what happened to Nervana. Their ASIC was better than a Pascal GP100, but it's clobbered by Volta V100. So at best you get a 6-12 month lead on them.
In contrast, if you can right-size the transistor count for expensive and very specific workloads across 100K+ machines like companies with big datacenters and big data sets can do, I see an opportunity for all of the BigCos to build custom ASICs for their siloed data sets and graphs. That's what GOOG is doing and it seems to be working for them so far. FB is now hiring to do the same thing I suspect.
Yes but over the lifetime of a GPU, you'll spend more on power draw than the physical hardware. That's where the savings come from, or at least that's what I've been told.
A V100 costs ~300/yr in electricity. If you are buying at the scale of 100k units, but can price per operation, even by just 10% (for example, by dropping features you don't care about), that's a million dollars of electricity over the lifetime of your hardware.
At a high level there is a design tradeoff where you put your transistors for a given chip. For a dense linear algebra/tensor processor, it basically comes down to using your transistors for memory or compute.
GPUs (and DSPs) historically are way on the compute side. You get kilobytes of on chip memory, and really fat parallel buses to off chip RAM.
On the other end, you have some chips that put more memory near the compute. It means you have less compute, but way better power efficiency. Each hop from a register to cache to off chip to between boards to between nodes is roughly 2-10x power hit, so you get orders of magnitudes here.
In terms of training, it's really hard to fit your training data on chip. In which case you end up with an architecture very similar to a GPU or DSP/TPU. NVIDIA is no slouch here. A couple years ago the big trick was reducing precision - you dont need single or double floats during training so an architecture more specialized at fp16 or int8 would get some big savings for power and throughput. NVIDIA is playing this game too. At Google/FB scale tweaks to improve cost may make sense but it doesnt seem like architecturally speaking there is some major design decision being left on the table anymore (I'd love to hear a specific counterpoint though!).
In terms of inference, there can be some big power savings putting the weights next to compute. This isn't rocket science and folks have been doing this for a while. In a strange way it can sometimes be more efficient to use a CPU, with fat caches.
>At Google/FB scale tweaks to improve cost may make sense but it doesnt seem like architecturally speaking there is some major design decision being left on the table anymore (I'd love to hear a specific counterpoint though!).
I'd expect if I knew any, they'd be under NDA. I'll just point out that a GPU, even one with specific "ML cores" as NVIDIA calls them, is going to have a bunch of silicon that is being used inefficiently (for more "conventional" GPU uses). There's room for cost saving there. Perhaps NVIDIA eventually moves into that space and produces ML-only chips, but they don't appear to be heading in that direction yet.
There are many researchers not bound by NDAs making novel chip architectures, some of which are on HN.
Secondly, GPUs don't really have very much hardware specialized for graphics anymore. If we called it a TPU I'm not sure you'd be making the same point :P At a company like MS/FB/Google where you don't need to leverage selling the same chip for gaming/vr/ML/mining for economy of scale, like you said you can reduce your transistor count and have the same compute fabric. This would reduce your idle power consumption through leakage but you wouldn't expect a huge drop in power during active compute. Because the smaller precision compute ends up increasing the compute per byte, you either need to find more parallelism to RAM, get faster RAM, reduce the clockrate of compute, or reduce the number of compute elements to find a balanced architecture with lower precision. If you just shrink the number of compute elements - voila! - you're close to what NVIDIA is doing with ML cores.
> Secondly, GPUs don't really have very much hardware specialized for graphics anymore. If we called it a TPU I'm not sure you'd be making the same point :P
I think this is semantics. Modern GPUs do have a lot of hardware that isn't specialized for machine learning. My limited knowledge says that very recent NVIDIA GPUs have some things that vaguely resemble TPU cores ("Tensor cores"), but they also have a lot of silicon for classic CUDA cores. Which I was calling "graphics" hardware, but might better be described as "silicon that isn't optimized for the necessary memory bandwidth for DL". So its still used non-optimally.
To be clear, you can still use the CUDA cores for DL. We did that just fine for a long time, they're just decidedly less efficiently than Tensor cores.