HuggingFace Training Cluster as a Service

TechTechTech · on Sept 5, 2023

At the moment of writing the cost estimate for 70B multimodal model with 7T tokens on 1000 H100 GPUs is $18,461,354 with 184 days of training time.

Anyone willing to share an estimate how cost will come down each year as hardware keeps improving and possible new methodologies are found?

Personally I would not be surprised if it is possible to train the same dataset for half the cost 12 months from now.

TradingPlaces · on Sept 5, 2023

It will not get cheaper until Nvidia is disrupted on the software side. There is already plenty of hardware that can do this cheaper, starting but not ending with Google’s TPU

rvz · on Sept 5, 2023

Correct.

It requires a breakthrough in software in finding new efficient methods in training, fine-tuning, these AI models which currently there is no way around it other than training the whole thing and burning millions in the process.

Until then, unless you are a big tech company that can eat the cost, it doesn't seem wise to waste your entire VC money on expensive fine-tuning and inference costs as your AI model scales to millions.

TradingPlaces · on Sept 5, 2023

I think this has a lot of potential https://www.modular.com/engine

sam0x17 · on Sept 5, 2023

At this point, I think there is sufficient motivation (dramatically high training costs) that we could see major algorithmic, architectural, and/or training methodology improvements at the code level that make these sorts of things possible on commodity hardware within a few years.

We're already starting to see that with a few projects and I think once the scale tips such that it becomes practical to train something of GPT 4 quality with < $10k, the main focus of current research will shift to generating new models trained on commodity hardware.

My true hope is that the entire problem domain eventually ends up falling within the range of commodity hardware and FANG finds it can't really add any value (other than perhaps convenience) regardless of their superior compute resources, resulting in massive democratization of this technology.

That will of course open things up and make LLMs more accessible to bad actors, but this is ultimately a much better thing than the likes of FANG / OpenAI / etc being the sole gatekeepers of this tech. Just like Google has very little real motivation to fight click-fraud (there have been rumors for years that it is responsible for a double-digit percent of their revenue), these mega corporations will have very little real motivation to stop "bad actors" from paying to use their APIs, so the democratized situation is the less Orwellian one ultimately, since bad actors are going to use it either way.

p1esk · on Sept 5, 2023

You can train it at half the cost today if you use LambdaLabs cluster at $1.89/H100/hr.

https://lambdalabs.com/service/gpu-cloud/reserved

GaggiX · on Sept 5, 2023

Well if you select "trainium nodes", it's already "only" $11,085,287.

andy_ppp · on Sept 5, 2023

Are there big reasons the training can’t be done SETI at home style - you could even pay people for use of their graphics cards and do the training multiple times on different machines to make sure results weren’t being gamed.

version_five · on Sept 5, 2023

There is that, I think it's https://vast.ai/ and pretty sure there is also a "community" one I've seen for gen AI but I can't remember the name.

elpocko · on Sept 5, 2023

AI Horde

version_five · on Sept 5, 2023

Yes that's what I was thinking of, thanks! https://aihorde.net

andy_ppp · on Sept 5, 2023

Is this inference not training?

nacs · on Sept 5, 2023

Yes it's inference only (and usually pretty slow at that).

wongarsu · on Sept 5, 2023

Training still relies on very low latency connection between all the devices. When distributing training across multiple machines most people use machines in close vicinity connected via infiniband to have the lowest possible latency.

Going from that to the dozens to hundreds of milliseconds of latency on the internet, or the hours if you do classical SETI@Home, is a big step. There are people working on it though.

supermdguy · on Sept 5, 2023

GPU memory bandwidth is a limiting factor for how fast training can happen, so it’s much more efficient to train models on locally connected high memory GPUs.

Also gradient updates from all nodes would need to get combined at least every few training steps, and it would take a while to sync all gradient updates across the network.

addandsubtract · on Sept 5, 2023

There's Petals[0], but the problem seems to be that the entire training data needs to be loaded into VRAM and can't be split up across devices.

[0] https://github.com/bigscience-workshop/petals

api · on Sept 5, 2023

In 10 years you will be able to do it at home on a machine that costs less than $5k

FanaHOVA · on Sept 5, 2023

The fact that the GPUs quantity dropdown cannot go over 1,000 drives home the "GPU poor" point from the SemiAnalysis post. Meta alone has 16,000 GPUs. OpenAI's cluster from 2020 had 10,000 GPUs. If you're serious about foundation models development and research, you have to go work at one of these "GPU rich" companies.

p1esk · on Sept 5, 2023

Or you can invent better models or discover more efficient ways to train existing ones. You know - do something other than dumb scaling up - like what Hinton (backprop, 1987), Lecun (convnets, 1989), or Vaswani, et al. (transformers, 2017) did.

voz_ · on Sept 5, 2023

I love this comment. Very HN. You’re absolutely right, everyone should just try to make paradigm shifts in the field.

p1esk · on Sept 5, 2023

The key word here is “try”. And we are not talking about “everyone“, just those who complain they don’t have access to $5k/hr GPU clusters.

PoignardAzur · on Sept 5, 2023

"Or better, you can do the same thing that three people managed to do in the entire industry in the last 36 years."

I mean, don't get me wrong, I'm all for improvements in AI efficiency, but maybe there isn't that much low-hanging fruit to pick? Tons of papers get published on transformers optimization techniques and barely any of them seem to stick.

FanaHOVA · on Sept 5, 2023

> do something other than dumb scaling up

This is exactly what people told OpenAI 8 years ago and look where we are now.

p1esk · on Sept 5, 2023

8 years ago the dumb scaling up was exactly what we needed. 8 years we’ve been riding that train. Don’t you think it’s time to try something new?

Dowwie · on Sept 5, 2023

Given how expensive it is to train, my impression is that the world in 2023 generally cannot afford to experiment with custom trained models and only well-funded organizations can within a range of acceptability. The risk of spending $20MM on training a large model that doesn't produce the desired outcome is going to blow back far worse than engineering failing to deliver features on time. How are teams/orgs approaching model training risk management, as in managing the risk that a model fails to deliver after spending 20 Million on training?

Next thoughts are how to "SETI model training", distributing compute to idle resources around the world.

rvz · on Sept 5, 2023

> The risk of spending $20MM on training a large model that doesn't produce the desired outcome is going to blow back far worse than engineering failing to deliver features on time. How are teams/orgs approaching model training risk management, as in managing the risk that a model fails to deliver after spending 20 Million on training?

This. Most startups claiming to be AI companies (90%) won't dare to bother train or fine tune AI models due to the massive costs involved in doing so and will just take an off the self model from HuggingFace anyway.

But what the AI bros won't tell you is that there is the incredible amount of risk when it all goes wrong after training as you pointed out. That is $20M down the drain if the results are sub-optimal and it is even worse when the 'researchers' cannot explain the reasoning behind the 'AI' underperforming other than it is just 'hallucinating' or just flat out buggy.

This training route is only available to those who can afford to foot the cost, but it is still a giant waste of electricity and effort in the end thanks to the decade-log inefficiencies and no better alternatives to these operations (training, fine-tuning, inference, etc) in deep learning.

zoogeny · on Sept 5, 2023

I think what I really want is turn-key fine tuning for existing foundational models. But honestly, even that is probably 2 years away before it is really a viable business. We lack sufficiently vetted commercial license foundational models. We lack sufficiently available and moderated diverse datasets for fine-tuning. We probably lack sufficient businesses to take the early adopter risk.

I'm planning an all-in strategy with AI but I believe the next 2 years will be lean. Hopefully by then the price for fine-tuning will have come down enough for medium sized businesses outside of the early adopter niche to give it a try. We'll have a couple of rounds of failures and successes so most people will have a decent roadmap to building successful products (and avoiding complete failures). We should also have a significant ecosystem of options in both OSS and commercial variations.

I feel like this is equivalent to the Internet in 1998. We're looking at the Yahoo's, the AOLs, and the Pets.com crop of businesses. But things won't really heat up for a while. Still plenty of time to grow into this space.

version_five · on Sept 5, 2023

The other day when they announced more funding there was some speculation here about how they would make money, with someone suggesting it's by driving users to cloud gpu platforms (aws, azure). This support that, and it suggest where they will end up, i.e. as a front end for azure.

https://news.ycombinator.com/item?id=37250647

YetAnotherNick · on Sept 5, 2023

They should focus more on finetuning I think. Finetuning is almost always better than pretraining, even if the pretraining dataset is very different than finetuning dataset. If I could train 30b model for $10 for few tens of million of tokens(basically proportional to current rate), I will definitely use it.

marcinzm · on Sept 5, 2023

You can already do that afaik. HuggingFace even provides some nice notebook examples on how to achieve it with AWS SageMaker and HuggingFace libraries. You don't need anywhere near 100-1000 GPUs to fine tune which makes it a much easier problem to just run on existing clouds.

YetAnotherNick · on Sept 5, 2023

I know and I use instances to train, but it would be a heavy improvement if all I need to do is select huggingface datasets and click train and get a model I could test in playground.

jstx1 · on Sept 5, 2023

> Train your LLM at scale on our infrastructure

Is it really their infrastructure or are they using a cloud provider and this wraps it up and provides convenience for a price?

brucethemoose2 · on Sept 5, 2023

Azure and such get such massive scaling cost benefits from scaling that HF's own GPUs would probably be more expensive anyway, even if they go AMD/Intel.

It does seem like they should run their own storage nodes, with the sheer quantity of models they host...

fxtentacle · on Sept 5, 2023

Everyone claims that, yet I have never seen it happen.

Typically, small companies get rebates on NVIDIA GPUs, but big established ones do not. So I would expect a startup with 100 GPUs to pay less per GPU than Azure.

jsemrau · on Sept 5, 2023

I'd think "infrastructur" includes the nice front end and Python API that they have proven to be capable to pull off already.

perfmode · on Sept 5, 2023

What’s the difference?

melx · on Sept 5, 2023

You end up paying more in the latter instance.

marcinzm · on Sept 5, 2023

Not counting the cost of learning how to cluster together 500 GPUs, the cost of learning how to train models efficiently on 500 GPUs, the cost of convincing a cloud provider to let you get 500 GPUs, the cost of trying to find a cloud provider that actually has 500 GPUs you can book, etc, etc.

techterrier · on Sept 5, 2023

lowest price from the dropdowns...$43k

chaxor · on Sept 5, 2023

So you can buy one H100 for our own server (if you can find it), or this.

wongarsu · on Sept 5, 2023

They are not the only game in town. Lambdalabs will rent you a H100 for $1.99/h. Well, they won't give you 500 at those conditions, they have their own reserved cluster pricing. But between the big cloud providers and specialized GPU providers there's a lot of middle ground available for people who want smaller training runs (though availability is an issue)

alekseiprokopev · on Sept 5, 2023

What models would you train if you had the money for various price points?

GaggiX · on Sept 5, 2023

I wonder what's the multimodal model, Flamingo?

julien_c · on Sept 5, 2023

Flamingo-style, see for instance the recently released IDEFICS: https://huggingface.co/blog/idefics

naillo · on Sept 5, 2023

The lockin attempts begin