I’m surprised how it’s only 30% cheaper vs nvidia. How come? This seems to indic...

felarof · 2024-09-11T20:31:16 1726086676

30% is a conservative estimate (to be precise, we went with this benchmark: https://github.com/GoogleCloudPlatform/vertex-ai-samples/blo...). However, the actual difference we observe ranges from 30-70%.

Also, calculating GPU costs is getting quite nuanced, with a wide range of prices (https://cloud-gpus.com/) and other variables that makes it harder to do apples-to-apples comparison.

p1esk · 2024-09-11T20:48:40 1726087720

Did you try running this task (finetuning Llama) on Nvidia GPUs? If yes, can you provide details (which cloud instance and time)?

I’m curious about your reported 30-70% speedup.

felarof · 2024-09-11T20:55:20 1726088120

I think you slightly misunderstood, and I wasn't clear enough—sorry! It's not a 30-70% speedup; it's 30-70% more cost-efficient. This is mainly due to non-NVIDIA chipsets (e.g., Google TPU) being cheaper, with some additional efficiency gains from JAX being more closely integrated with the XLA architecture.

No, we haven't run our JAX + XLA on NVIDIA chipsets yet. I'm not sure if NVIDIA has good XLA backend support.

p1esk · 2024-09-11T21:10:55 1726089055

Then how did you compute the 30-70% cost efficiency numbers compared to Nvidia if you haven’t run this Llama finetuning task on Nvidia GPUs?

felarof · 2024-09-11T21:26:38 1726089998

Check out this benchmark where they did an analysis: https://github.com/GoogleCloudPlatform/vertex-ai-samples/blo....

At the bottom, it shows the calculations around the 30% cost efficiency of TPU vs GPU.

Our range of 30-70% is based on some numbers we collected from running fine-tuning runs on TPU and comparing them to similar runs on NVIDIA (though not using our code but other OSS libraries).

p1esk · 2024-09-11T21:44:40 1726091080

It would be a lot more convincing if you actually ran it yourself and did a proper apples to apples comparison, especially considering that’s the whole idea behind your project.

KaoruAoiShiho · 2024-09-11T23:40:25 1726098025

It's also comparing prices on google cloud, which has its own markup, a lot more expensive than say runpod. Runpod is $1.64/hr for the A100 on secure cloud while the A100 on Google is $4.44/hr. A lot more expensive... yeah. So in that context a 30% price beat is actually a huge loss overall.

spullara · 2024-09-12T05:36:01 1726119361

who trains on a100 at this point lol

KaoruAoiShiho · 2024-09-12T14:19:55 1726150795

It's the chosen point of comparison on the linked paper.

felarof · 2024-09-11T22:01:18 1726092078

Totally agree, thanks for feedback! This is one of the TODOs on our radar.

cherioo · 2024-09-11T20:12:32 1726085552

Nvidia margin is like 70%. Using google TPU is certainly going to erase some of that.

m3kw9 · 2024-09-11T23:40:27 1726098027

They sell cards and they are selling out