A lot of those GPUs are for their 3B users to run inferencing, no?

benreesman · 2024-07-04T00:06:35

It’s been a very long time since I had any inside baseball, but I very much doubt that Hopper gear is in the hot inference path.

The precisions and mantissa/exponent ratios you want for inference are just different to a mixed-precision, fault tolerant, model and data parallel pipeline.

Hopper is for training mega-huge attention decoders: TF32, bfloat16, hot paths to the SRAM end of the cache hierarchy with cache coherency semantics that you can reason about. Parity gear for fault tolerance, it’s just a different game.

LarsDu88 · 2024-07-03T22:43:26

True that, but I think in a very short amount of time, using dedicated general purpose GPUs just for inferencing is going to be mega overkill.

If there's dedicated inferencing silicon (like say the thing created by Groq), all those GPUs will be power sucking liabilities, and then the REAL singularity superintelligence level training can begin.

campers · 2024-07-04T00:52:54

Etched is another dedicated inference hardware company that recently announced their product. It only works for transformer based models, but is ~20x faster than a H100