Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
HuggingFace Training Cluster as a Service (huggingface.co)
101 points by kashifr on Sept 5, 2023 | hide | past | favorite | 45 comments


At the moment of writing the cost estimate for 70B multimodal model with 7T tokens on 1000 H100 GPUs is $18,461,354 with 184 days of training time.

Anyone willing to share an estimate how cost will come down each year as hardware keeps improving and possible new methodologies are found?

Personally I would not be surprised if it is possible to train the same dataset for half the cost 12 months from now.


It will not get cheaper until Nvidia is disrupted on the software side. There is already plenty of hardware that can do this cheaper, starting but not ending with Google’s TPU


Correct.

It requires a breakthrough in software in finding new efficient methods in training, fine-tuning, these AI models which currently there is no way around it other than training the whole thing and burning millions in the process.

Until then, unless you are a big tech company that can eat the cost, it doesn't seem wise to waste your entire VC money on expensive fine-tuning and inference costs as your AI model scales to millions.


I think this has a lot of potential https://www.modular.com/engine


At this point, I think there is sufficient motivation (dramatically high training costs) that we could see major algorithmic, architectural, and/or training methodology improvements at the code level that make these sorts of things possible on commodity hardware within a few years.

We're already starting to see that with a few projects and I think once the scale tips such that it becomes practical to train something of GPT 4 quality with < $10k, the main focus of current research will shift to generating new models trained on commodity hardware.

My true hope is that the entire problem domain eventually ends up falling within the range of commodity hardware and FANG finds it can't really add any value (other than perhaps convenience) regardless of their superior compute resources, resulting in massive democratization of this technology.

That will of course open things up and make LLMs more accessible to bad actors, but this is ultimately a much better thing than the likes of FANG / OpenAI / etc being the sole gatekeepers of this tech. Just like Google has very little real motivation to fight click-fraud (there have been rumors for years that it is responsible for a double-digit percent of their revenue), these mega corporations will have very little real motivation to stop "bad actors" from paying to use their APIs, so the democratized situation is the less Orwellian one ultimately, since bad actors are going to use it either way.


You can train it at half the cost today if you use LambdaLabs cluster at $1.89/H100/hr.

https://lambdalabs.com/service/gpu-cloud/reserved


Well if you select "trainium nodes", it's already "only" $11,085,287.


Are there big reasons the training can’t be done SETI at home style - you could even pay people for use of their graphics cards and do the training multiple times on different machines to make sure results weren’t being gamed.


There is that, I think it's https://vast.ai/ and pretty sure there is also a "community" one I've seen for gen AI but I can't remember the name.


AI Horde


Yes that's what I was thinking of, thanks! https://aihorde.net


Is this inference not training?


Yes it's inference only (and usually pretty slow at that).


Training still relies on very low latency connection between all the devices. When distributing training across multiple machines most people use machines in close vicinity connected via infiniband to have the lowest possible latency.

Going from that to the dozens to hundreds of milliseconds of latency on the internet, or the hours if you do classical SETI@Home, is a big step. There are people working on it though.


GPU memory bandwidth is a limiting factor for how fast training can happen, so it’s much more efficient to train models on locally connected high memory GPUs.

Also gradient updates from all nodes would need to get combined at least every few training steps, and it would take a while to sync all gradient updates across the network.


There's Petals[0], but the problem seems to be that the entire training data needs to be loaded into VRAM and can't be split up across devices.

[0] https://github.com/bigscience-workshop/petals


In 10 years you will be able to do it at home on a machine that costs less than $5k


The fact that the GPUs quantity dropdown cannot go over 1,000 drives home the "GPU poor" point from the SemiAnalysis post. Meta alone has 16,000 GPUs. OpenAI's cluster from 2020 had 10,000 GPUs. If you're serious about foundation models development and research, you have to go work at one of these "GPU rich" companies.


Or you can invent better models or discover more efficient ways to train existing ones. You know - do something other than dumb scaling up - like what Hinton (backprop, 1987), Lecun (convnets, 1989), or Vaswani, et al. (transformers, 2017) did.


I love this comment. Very HN. You’re absolutely right, everyone should just try to make paradigm shifts in the field.


The key word here is “try”. And we are not talking about “everyone“, just those who complain they don’t have access to $5k/hr GPU clusters.


"Or better, you can do the same thing that three people managed to do in the entire industry in the last 36 years."

I mean, don't get me wrong, I'm all for improvements in AI efficiency, but maybe there isn't that much low-hanging fruit to pick? Tons of papers get published on transformers optimization techniques and barely any of them seem to stick.


> do something other than dumb scaling up

This is exactly what people told OpenAI 8 years ago and look where we are now.


8 years ago the dumb scaling up was exactly what we needed. 8 years we’ve been riding that train. Don’t you think it’s time to try something new?


Given how expensive it is to train, my impression is that the world in 2023 generally cannot afford to experiment with custom trained models and only well-funded organizations can within a range of acceptability. The risk of spending $20MM on training a large model that doesn't produce the desired outcome is going to blow back far worse than engineering failing to deliver features on time. How are teams/orgs approaching model training risk management, as in managing the risk that a model fails to deliver after spending 20 Million on training?

Next thoughts are how to "SETI model training", distributing compute to idle resources around the world.


> The risk of spending $20MM on training a large model that doesn't produce the desired outcome is going to blow back far worse than engineering failing to deliver features on time. How are teams/orgs approaching model training risk management, as in managing the risk that a model fails to deliver after spending 20 Million on training?

This. Most startups claiming to be AI companies (90%) won't dare to bother train or fine tune AI models due to the massive costs involved in doing so and will just take an off the self model from HuggingFace anyway.

But what the AI bros won't tell you is that there is the incredible amount of risk when it all goes wrong after training as you pointed out. That is $20M down the drain if the results are sub-optimal and it is even worse when the 'researchers' cannot explain the reasoning behind the 'AI' underperforming other than it is just 'hallucinating' or just flat out buggy.

This training route is only available to those who can afford to foot the cost, but it is still a giant waste of electricity and effort in the end thanks to the decade-log inefficiencies and no better alternatives to these operations (training, fine-tuning, inference, etc) in deep learning.


I think what I really want is turn-key fine tuning for existing foundational models. But honestly, even that is probably 2 years away before it is really a viable business. We lack sufficiently vetted commercial license foundational models. We lack sufficiently available and moderated diverse datasets for fine-tuning. We probably lack sufficient businesses to take the early adopter risk.

I'm planning an all-in strategy with AI but I believe the next 2 years will be lean. Hopefully by then the price for fine-tuning will have come down enough for medium sized businesses outside of the early adopter niche to give it a try. We'll have a couple of rounds of failures and successes so most people will have a decent roadmap to building successful products (and avoiding complete failures). We should also have a significant ecosystem of options in both OSS and commercial variations.

I feel like this is equivalent to the Internet in 1998. We're looking at the Yahoo's, the AOLs, and the Pets.com crop of businesses. But things won't really heat up for a while. Still plenty of time to grow into this space.


The other day when they announced more funding there was some speculation here about how they would make money, with someone suggesting it's by driving users to cloud gpu platforms (aws, azure). This support that, and it suggest where they will end up, i.e. as a front end for azure.

https://news.ycombinator.com/item?id=37250647


They should focus more on finetuning I think. Finetuning is almost always better than pretraining, even if the pretraining dataset is very different than finetuning dataset. If I could train 30b model for $10 for few tens of million of tokens(basically proportional to current rate), I will definitely use it.


You can already do that afaik. HuggingFace even provides some nice notebook examples on how to achieve it with AWS SageMaker and HuggingFace libraries. You don't need anywhere near 100-1000 GPUs to fine tune which makes it a much easier problem to just run on existing clouds.


I know and I use instances to train, but it would be a heavy improvement if all I need to do is select huggingface datasets and click train and get a model I could test in playground.


> Train your LLM at scale on our infrastructure

Is it really their infrastructure or are they using a cloud provider and this wraps it up and provides convenience for a price?


Azure and such get such massive scaling cost benefits from scaling that HF's own GPUs would probably be more expensive anyway, even if they go AMD/Intel.

It does seem like they should run their own storage nodes, with the sheer quantity of models they host...


Everyone claims that, yet I have never seen it happen.

Typically, small companies get rebates on NVIDIA GPUs, but big established ones do not. So I would expect a startup with 100 GPUs to pay less per GPU than Azure.


I'd think "infrastructur" includes the nice front end and Python API that they have proven to be capable to pull off already.


What’s the difference?


You end up paying more in the latter instance.


Not counting the cost of learning how to cluster together 500 GPUs, the cost of learning how to train models efficiently on 500 GPUs, the cost of convincing a cloud provider to let you get 500 GPUs, the cost of trying to find a cloud provider that actually has 500 GPUs you can book, etc, etc.


lowest price from the dropdowns...$43k


So you can buy one H100 for our own server (if you can find it), or this.


They are not the only game in town. Lambdalabs will rent you a H100 for $1.99/h. Well, they won't give you 500 at those conditions, they have their own reserved cluster pricing. But between the big cloud providers and specialized GPU providers there's a lot of middle ground available for people who want smaller training runs (though availability is an issue)


What models would you train if you had the money for various price points?


I wonder what's the multimodal model, Flamingo?


Flamingo-style, see for instance the recently released IDEFICS: https://huggingface.co/blog/idefics


The lockin attempts begin




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: