Ok so they claim in the article, 50000 TPU’s is equivalent to 10 exaflop floatin...

rwitten · on Nov 10, 2023

Hey! I'm an contributor on this (Rafi Witten), all opinions my own.

You're asking the right question but I think the math is off by a bit. The equivalent number on the H100's is 989 TFLOP/s/chip so the equivalent job is ~10K H100's = (10 * 10^18) / (989 * 10^12). (Both chips also have 8-bit acceleration!)

I believe this is the largest ML job both by exaflops and number of chips every demonstrated. Other companies own more chips or exaflops than we show in this job but getting all the hardware working at once on a single job is a different matter! :-)

sashank_1509 · on Nov 10, 2023

I think your math is also slightly off, in the Google article, it claims “that is capable of achieving 10 exa-FLOPs (16-bit).” , so you should be comparing with 16 bit operations from a H100.

989 is TF32 core, for 16 bit it is 1979, so I guess around 5000 H100’s in a single training job would be equivalent to the training job mentioned in this article.

Either way I actually would not be surprised if OpenAI has launched a single job on more than 10k GPU’s, but I also am not very knowledgeable on practical scaling. Congrats on the feat!

aschleck · on Nov 10, 2023

1979 16 bit flops on an H100 is with sparsity. See footnote 2 on https://www.nvidia.com/en-us/data-center/h100/. You should be halving it for non-sparse flops.

YetAnotherNick · on Nov 11, 2023

GP is correct. With sparsity it is 3958. 1979 Tflop/s is without sparsity.

emu · on Nov 11, 2023

No, it is not. That's the sparse fp8 flop number, but you need to ignore sparsity and compare bf16 flops not fp8 flops for the comparison the ancestor post is making.

latchkey · on Nov 10, 2023

I'd love to hear more about the challenges of getting the hardware working.

aschleck · on Nov 10, 2023

It's worth noting that just because an H100 has a higher flops number doesn't mean your program is actually hitting that number of flops. Modern TPUs are surprisingly competitive with Nvidia on a perf/$ metric, if you're doing cloud ML they are absolutely worth a look. We have been keeping costs down by racking our own GPUs but TPUs are so cost effective that we need to do some thinking about changing our approach.

I'm not certain but I think part of this is that XLA (for example) is a mountain of chip-specific optimizations between your code and the actual operations. So comparing your throughput between GPU and TPU is not just flops-to-flops.

lern_too_spel · on Nov 10, 2023

This is a blog post from Google Cloud marketing. It's saying that you, too, could train an LLM on Google Cloud if you hand them enough money. You can't do that on Inflection's or Tesla's clusters. Similar marketing blog post from last year: https://cloud.google.com/blog/products/compute/calculating-1...

The PaLM paper linked in the blog post is about how to get something actually useful out of that compute.

latchkey · on Nov 10, 2023

It sounds like they partnered with CoreWeave to use their equipment and it is "only" 3500 gpus so far.

https://inflection.ai/inflection-ai-announces-1-3-billion-of...

https://inflection.ai/nvidia-coreweave-mlperf

abatilo · on Nov 11, 2023

https://youtu.be/z3hmfSVmyqg?si=eLPZ0D6ug3D6PreI

As of 2 months ago, they had at least 7000 up and running, fwiw

latchkey · on Nov 11, 2023

The meat starts here:

https://youtu.be/z3hmfSVmyqg?feature=shared&t=3328

I'm curious why it is so hard for them to deploy the compute. They seem to be fairly behind schedule.

marmaduke · on Nov 10, 2023

> That is equivalent

On what number or op for the h100?