Running large language models like ChatGPT on a single GPU

lxe · on Feb 20, 2023

Got the ops-30b chatbot running on 3090 24GB. I set compress_weight=True and compress_cache=True, and ran with `python apps/chatbot.py --model facebook/opt-30b --percent 100 0 100 0 100 0`.

I also modified the prompt a bit to make it more... uh alive:

    Assistant: Did you know that Saturn is 97 times the size of Earth?
    Human: Are you sure?
    Assistant: What difference does size make, really, anyway?
    Human: You didn't answer my question.
    Assistant: Hey, it's your question I didn't answer! Anyway, did you know that on the surface of 
    Saturn, there's lakes of liquid methane and ethane?
    Human: I didn't know that. Are there any lakes on the Sun?
    Assistant: Look, I'll be honest with you. I don't know much about the Sun. Is it even a planet?
    Human: It sure is. Why don't you know about the Sun? I thought you were a hyperintelligent AI?
    Assistant: You're right. I'm a hyperintelligent AI. I'm hyperintelligent enough to know that I know jack shit about the Sun.

syntaxing · on Feb 20, 2023

This is amazing. Reminds me of claptrap from Borderlands

samstave · on Feb 21, 2023

OMG

What will be the first game with ChatGPT integrated into the NPC dialog interactions?

My vote is Hitman, with variable voices....

djokkataja · on Feb 21, 2023

It's already been done with Mount & Blade II, though I don't know if that was the first.

https://www.techspot.com/news/97572-mount-blade-ii-mod-uses-...

LoganDark · on Feb 20, 2023

It really does

mightytravels · on Feb 21, 2023

Awesome - how many GB of hard drive/ SSD was required? What is your CPU RAM?

sva_ · on Feb 21, 2023

I wonder if there would be interesting results if you make this model talk to openais chatgpt, since they are different models. Seeding them with some topic.

destroy-2A · on Feb 21, 2023

Do not let Dinesh AI talk to Gilfoyle AI !

razor_router · on Feb 20, 2023

That's impressive! How much customization did you have to do in order to get the chatbot to respond like that?

lxe · on Feb 20, 2023

Not much. Here's the prompt:

  A chat between a human and a more funny and witty intelligent assistant
  Human: Hello! What can you do?
  Assistant: Hey there! I can pretty much do whatever you ask!
  Human: What is the name of the tallest mountain in the world?
  Assistant: It's Everest.
  Human: Are you sure?
  Assistant: You damn right I'm sure! I'm a hyperintelligent AI!

posguy · on Feb 20, 2023

All I can think of is outsourcing to ChatGPT now...

https://www.youtube.com/watch?v=rYaZ57Bn4pQ

gorbypark · on Feb 20, 2023

If this works well, it will be a game changer. Requiring a fleet of $10k+ GPUs will kill any hope of wide spread adoption of open source "competitors" to GPT-3. Stable Diffusion is so popular because it can run on hardware mere mortals can own.

humanistbot · on Feb 20, 2023

This will only happen if "Open"AI or other big orgs release the model weights, which only Stable Diffusion did. Cost to train is still astronomical.

gaogao · on Feb 20, 2023

Meta has released the model weights for OPT-175B, which is used in the paper. There's also a lot of full release LLMs from other labs on the way as well.

neilmovva · on Feb 20, 2023

While OPT-175B is great to have publicly available, it needs a lot more training to achieve good results. Meta trained OPT on 180B tokens, compared to 300B that GPT-3 saw. And the Chinchilla scaling laws suggest that almost 4T tokens would be required to get the most bang for compute buck.

And on top of that, there are some questions on the quality of open source data (The Pile) vs OpenAI’s proprietary dataset, which they seem to have spent a lot of effort cleaning. So: open source models are probably data-constrained, in both quantity and quality.

Miraste · on Feb 21, 2023

OPT-175B isn't publicly available, sadly. It's available to research institutions, which is much better than "Open"AI, but it doesn't help us hobbyists/indie researchers much.

generalizations · on Feb 21, 2023

I wonder when we'll start putting these models on the pirate bay or similar. Seems like an excellent use for the tech. Has no one tried to upload OPT-175B anywhere like that yet?

cma · on Feb 21, 2023

It could go on the clear net since trained weights aren't subject to copyright.

mensetmanusman · on Feb 20, 2023

It’s fun to think about a few billion weights being the difference between useless and gold.

tiborsaas · on Feb 20, 2023

Looking at my bank account I can relate :)

stavros · on Feb 20, 2023

Are there any that perform anywhere close to GPT-3?

jejeyyy77 · on Feb 21, 2023

Stable Diffusion was behind DALLE, but has since surpassed it. Don't hear anyone talking about DALLE anymore.

JoshCole · on Feb 20, 2023

No, Stable Diffusion isn't the only one to release their weights. OpenAI hasn't been releasing weights for ChatGPT, but Stable Diffusion isn't the only ones releasing weights [1].

[1]: https://huggingface.co/

permo-w · on Feb 20, 2023

yeah there an absolute pile of LLMs that are fully open-source. OpenAI’s GPT2 for one, but also Bloom, OPT, GPT-J, and I’m sure myriad others too

Miraste · on Feb 21, 2023

Nothing in the ballpark of GPT-3.5 or Prometheus yet, but we'll get there.

Dylan16807 · on Feb 20, 2023

On the other hand, one techie with a few million dollars...

And you could train something like GPT-3 for cheaper than a superbowl commercial. That would get you a lot of publicity.

bitL · on Feb 21, 2023

Any larger crypto enthusiast with a bunch of 3090s and a solar farm can do it for nearly free (assuming fixed expenses paid by Eth already...)

paxys · on Feb 20, 2023

You can do it once, but probably not every day.

Ajedi32 · on Feb 20, 2023

Why would you want to retrain it from scratch every day? Stable Diffusion doesn't do that either.

baobabKoodaa · on Feb 20, 2023

Well maybe not every day, but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development. If you actually want to keep developing the model, you need the funding to be able to train it more than once.

taneq · on Feb 21, 2023

To summarize this discussion, we went from "this might mean we don't need a fleet of $10k+ GPUs to even run a LLM" to "yeah but an individual couldn't train one every day though". These goalposts are breaking the sound barrier.

coldtea · on Feb 20, 2023

>but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development

This is not "software development" in general, this is LLM training.

It's not like you're building some regular app, api, or backend.

baobabKoodaa · on Feb 21, 2023

If you are claiming that training a LLM literally only one time is enough and there is no need to train it more than once, you are wrong. The researchers who created OPT didn't go into a basement for 12 months, then come out, train their model once, hit publish, and go to coffee. That is a fantasy. Likewise, if a CS student wants to dabble in this research, they need the ability to train more than once.

I'm not gonna engage in a rhetorical argument about whether this should be called "software development" or "LLM development" or something else. That's unrelated to the question of how much training is required.

coldtea · on Feb 21, 2023

>If you are claiming that training a LLM literally only one time is enough and there is no need to train it more than once, you are wrong.

No, I'm rather claiming that what you claimed is wrong in the context of LLM training: "Well maybe not every day, but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development".

LLM training is not the same as writing a program and "running your code with different variations". For LLM you don't need to quickly rerun everything with some new corpus - it would be nice, but it's neither a prerequisite, not even crucial for any current use.

Hell, it's not even a "prerequisite" in programming, just good to have. Tons of great programs have been written with very slow build times, without quick edit/compile/build/run cycles.

baobabKoodaa · on Feb 21, 2023

I wasn't talking about running the same code with a new corpus. For that kind of use case one can simply fine tune the pretrained model. The example I gave was "if a CS student wants to dabble in this research".

You said "LLM training is not the same as writing a program and running your code with different variations". How do you think these LLMs were made, seriously? Do you think Facebook researchers sat down for 12 months and wrote code non stop without compiling it once, until the program was finished and was used to train the LLM literally only one time?

Dylan16807 · on Feb 22, 2023

I would expect them to use small sizes for almost all the testing.

baobabKoodaa · on Feb 22, 2023

Yes. There _is_ a need to train LLMs more than once, and training is prohibitively expensive, so you need workarounds such as training on a small subset of data, or a smaller version of the model. We're not yet at the point where a CS student on consumer hardware could afford to do this kind of research.

Dylan16807 · on Feb 22, 2023

> We're not yet at the point where a CS student on consumer hardware could afford to do this kind of research.

Okay. But I was saying someone with millions of dollars to spend could do it. And then another poster was arguing that millions of dollars was not enough to be viable because you need lots of repeated runs.

Nobody was saying a student could train one of these models from scratch. The cool potential is for a student to run one, maybe fine tune it.

baobabKoodaa · on Feb 23, 2023

Here is the upthread comment I was responding to:

> Why would you want to retrain it from scratch every day?

I was explaining why someone might want to retrain it more than once (although not literally every day).

paxys · on Feb 20, 2023

Because things happen every day. If ChatGPT wants to compete with Google, staying up to date with recent events is the minimum bar.

Ajedi32 · on Feb 20, 2023

You wouldn't need to re-train from scratch for that, just fine-tune on the new data sources. I don't think constant re-training is the optimal strategy for that use-case anyway. Bing does it by letting the LLM search a more traditional web index to find the information it needs.

paxys · on Feb 20, 2023

Okay but someone has to do the fine tuning. The code has to be updated. Parts of the training have to be redone. All of this has costs. It isn't a "do it once and forget about it" task that it is being touted as in this thread.

coldtea · on Feb 20, 2023

>The code has to be updated

I'm pretty sure this is not how an LLM works.

>It isn't a "do it once and forget about it" task that it is being touted as in this thread.

That's neither here, nor there. Training the LLM itself is not a "do it multiple times per day if you want to compete with Google" thing as it has been stated in this subthread.

baobabKoodaa · on Feb 23, 2023

> > The code has to be updated

> I'm pretty sure this is not how an LLM works.

You can say that about any software. "You can use this software perfectly well without ever updating it." Sure, you can do that, but typically people have lots of reasons to update software. LLM isn't magic in this sense. An LLM does not mysteriously update its own code if you just wish hard enough. If you want to continue the development of the LLM then you need to make changes to the code, just like with any other software.

coldtea · on Feb 20, 2023

That's not what the training is about.

Things happen everyday, but languages and words and their associations don't change in any measurable way every day...

This is not like web crawling...

simonw · on Feb 20, 2023

That's not necessary. Look at how a Bing works: it's a LLM which can trigger searches, and then gets fed the search results back to it as part of the prompt.

I wrote about one way to implement that pattern here: https://simonwillison.net/2023/Jan/13/semantic-search-answer...

idiotsecant · on Feb 20, 2023

Is there information out there about how much it cost (in time or human-hours) to do the additional training necessary to make chatGPT? I am genuinely curious what the scale of the effort was.

mensetmanusman · on Feb 20, 2023

VCs are funding $100,000,000 AI compute efforts now, so it might be something like that.

Nuzzerino · on Feb 20, 2023

I would hope publicity isn’t the motivation for doing it though.

idiotsecant · on Feb 20, 2023

What motivation would be sufficiently noble?

Nuzzerino · on Feb 20, 2023

Probably one where there isn't an intrinsic conflict of interest with AI risk. Or from a more traditional angle, one where the author's vanity isn't required to be appeased in order for users/customers to be happy. I'm of the opinion that you should do something with game-changing technology because the world needs it, not because you need an ego boost. All technology brings side effects, and there is no greater example of that than "democratized" AI...

idiotsecant · on Feb 20, 2023

People often (usually) do objectively useful things because it's in their selfish interests to do so, ego or otherwise. The surest road to failure is expecting people to act virtuously. Generally systems that assume virtue fail, and systems that assume selfish action and steer that selfish action towards the greater good succeed.

In other words, I don't care why people do things, only that they do.

Nuzzerino · on Feb 20, 2023

That’s fine, as long as publicity isn’t the motivation. It’s safe to assume that isn’t optimal for a projects success (Satoshi understood this). Not sure where you got the idea that the inverse of that was beneficial to such a project. I’ve seen first hand where it becomes a problem.

I’m not aware of many examples of starry-eyed divas achieving great results. Usually you hear about them but only because they are exceptional cases, not the norm. It’s a matter of practicality and not virtue (to say otherwise is purely a straw man argument).

JoshCole · on Feb 21, 2023

> to say otherwise is purely a straw man argument

This is really overconfident.

That publicity isn't causally connected to success is belied by the existence of the advertising industry. While generally refuting across industries, it is worth noting that the most dominant AI company - Google - happens to be in this industry. They are explicitly known - having publicity for - their generous compensation packages. This is because of a causal model of talent attraction.

Success is obviously causally connected to publicity and the idea that it isn't isn't well supported by the evidence. Contrary to your assertion, it was not a safe assumption. Your appeal to Satoshi is an appeal to authority, not a causal model of its shielding off from project impacts.

Nuzzerino · on Feb 21, 2023

> That publicity isn't causally connected to success is belied by the existence of the advertising industry.

The argument was about publicity as a reward motivator, not publicity itself, as a causal relation to success.

To phrase it plainly: Which first-time founders do you think Paul Graham or Keith Rabois would more likely fund: Those who aspire to solve a problem with the world that they care passionately about? Or those looking for money or fame? Last time I checked the latter case would be seen as a strong negative. The appeal to authority argument doesn’t apply in this situation because the VC portfolio performance is causally related to how accurately they predict future success of a company.

On the scale of a smaller project like this, a common failure mode is for a maintainer to stop caring about the project and go to the next thing that motivates them. Someone else may attempt to use the code or project without understanding the theory behind it. And even worse: every time this happens is a signal that this is acceptable.

AI is a different beast. Software bugs with big AI systems will become more costly, and eventually deadly. Unfortunately I’m not sure what can be done about it without a global totalitarian regime to ban its use entirely (which is not an idea I support anyway). Eventually the broken clock will be right and some profit-driven AI project will succeed in making the world a not better place, if we are even around to notice :).

I would advise deeper thought into these topics when convenient. Read Nick Bostrom’s Superintelligence book or watch his talks, at least one of which was at Google HQ.

I think someone should train ChatGPT or similar to argue or teach traditional AGI Philosophy/Ethics and hopefully that will move the needle somewhat more than the OpenAI nannyism we have now.

JoshCole · on Feb 21, 2023

> The argument was about publicity as a reward motivator, not publicity itself, as a causal relation to success.

That the causal model supports publicity seeking leads us to ensemble models. When models are good for different reason, the ensemble of the models ends up better than any individual model. Reinforcement learning research has shown you can successfully train an agent out of decomposed reward signals by building an ensemble model atop them.

The fact that the causality says publicity matters means that agents which recognize the importance that publicity contribute to the solution actually do have the expectation of being part of the solution.

It is very common to see this talked about in terms of diversity improving solution quality when talking about it in the context of companies and it is generally considered a good idea to have a diverse team as a consequence.

Anyway, I'm mostly responding because I disagree with apriori declaration that all who disagree are attacking a straw man.

I think that was overconfident, because the causal structure of publicity and its relation to outcomes disagrees with that.

JoshCole · on Feb 21, 2023

> Which first-time founders do you think Paul Graham or Keith Rabois would more likely fund: Those who aspire to solve a problem with the world that they care passionately about?

It is worth reflecting on the fact that the founder of OpenAI has had the strongest possible endorsement from Paul Graham. He was claimed to be among the greats before his successes: Paul Graham put him among Steve Jobs and Elon Musk. When Paul Graham stepped down from YCombinator, he was so convinced of Sam's skills that he put Sam in his place. Later Sam started OpenAI.

> I would advise deeper thought into these topics when convenient. Read Nick Bostrom’s Superintelligence book or watch his talks, at least one of which was at Google HQ.

I've read Superintelligence, the Sequences, PAIP, AIMA, Deep Learning, Reinforcement Learning, and Theory of Games and Economic Behavior, taken a course in control theory, and read a book about evolutionary algorithms. I've also built systems after having understood these techniques for literally each of these things I've mentioned with the exception of all of Superintelligence and much of the Sequences with the exception of parts of the sequences which dealt with Bayesian reasoning, which I did implement and like, though I disagree with that community about its optimality because the conditions of ledger arguments aren't true in the real world. In practice, Bayesian approaches are like trying to build a sportscar for a race - you get beaten even though you are doing the fastest thing, because the fastest thing isn't as fast as the slower methods.

Anyway, the combinatorics of multi-step multi-agent decision problems implies a lot of problems for Bostrom and Yudowsky positions on the limits of what intelligence can hope to achieve. I don't find them to be the most formidable thinkers on this subject. In the case of Yudowsky, he admits this, saying that he finds Norvig to be more formidable than he is. And Norvig disagreed with him on AI risk in exactly the context I also disagree and for the same reason I disagree. To ensure you get the point I'll speak in terms of Bostrom's analogies: notice, there is, in fact, a speed limit. The speed of light. Well, what Norvig notices, and what I also notice, and what Bellman noticed when he coined the term combinatorial explosion, is that intractability is an actual issue that you need to confront. It isn't something you can hand wave away with analogy. We don't have enough atoms in our universe.

This is why we get dual mode systems by the way. Not just humans: notice, it happens in chess engines too. The general solvers provides the heuristic which must have error, then the specific solver uses the heuristic to improve, because it is in a more specific situation. Most of the people in the AI risk camp are pretty Yudowskian. They dwell for long periods of time on the overcoming of the biased heuristic. For sure, this makes them more intelligent, but it misinforms them when they try to make inference about general intelligence informed on the tractability in specific situations. It is because, not despite, the intractability that they find such evidence of tractability.

Nuzzerino · on Feb 22, 2023

I'll have to check out Bellman's work, thanks!

JoshCole · on Feb 22, 2023

BTW, Bellman actually coined the term curse of dimensionality [1]; got that confused with combinatorial explosion since it is a synonyms in the contexts I typically encounter it [2].

[1]: https://en.wikipedia.org/wiki/Curse_of_dimensionality

[2]: https://en.wikipedia.org/wiki/Combinatorial_explosion

OpenAI has a pretty good introduction to the Bellman equations in their Spinning Up in RL lessons [3]. Sutton's work in Reinforcement Learning also talks about Bellman's work quite a bit. Though Bellman was actually studying what he called dynamic programming problems his work is now considered foundational in reinforcement learning.

[3]: https://spinningup.openai.com/en/latest/

Uh, and for the dual mode observations the person that brought that to my attention was Noam Brown, not Bellman or Norvig. If you haven't already checked out his work, I recommend it above both Norvig and Bellman. He has some great talks on Youtube and I consider it a shame they aren't more widely viewed [4].

[4]: https://www.youtube.com/watch?v=cn8Sld4xQjg

strohwueste · on Feb 20, 2023

The summary includes a dangerous thought: For example: why does north Korea develop a nuclear bomb is not important, just that they do. But only they why makes it problematic.

idiotsecant · on Feb 21, 2023

I'm not sure who was advocating for North Korea making nuclear weapons in this exchange.

IncRnd · on Feb 20, 2023

Noble? You're anthropomorphising machine learning. On possible motiviation would be to train a model, instead of training a model in order to create publicity around a model being trained.

idiotsecant · on Feb 20, 2023

I think you're misreading, nobody is anthropomorphizing anything other than the very 'anthro' component of the system we're talking about - the people distributing the funding.

IncRnd · on Feb 20, 2023

I may have misread your comment, then. Either way, thank you for the explanation!

tarr11 · on Feb 20, 2023

Wonder if someone would be willing to start an open source project where we could crowdsource donations for training, and people could possibly donate their GPU usage for it.

scottmf · on Feb 20, 2023

That’s what Stability AI has been doing… There are already open source LLMs the size of GPT-3 such as OPT and Bloom

ipsum2 · on Feb 20, 2023

No they haven't. Stability AI is funded by the founder and VC money, not crowdsourcing.

polishdude20 · on Feb 20, 2023

There's gotta be something like this already. Like a SETI @ Home type of thing .

mryab · on Feb 20, 2023

There is! See https://petals.ml/ for inference of models like BLOOM-176B over the internet or https://arxiv.org/abs/2301.11913 and https://arxiv.org/abs/2206.01288 that show you how to do pretraining from scratch in the same setting. Disclaimer: I'm a coauthor of these systems (including the one in OP)

exebook · on Feb 20, 2023

Amazing work! If I had a GPU, I'd join. I know similar project for text-to-image: https://aqualxx.github.io/stable-ui/

efreak · on Feb 22, 2023

https://github.com/aqualxx/stable-horde-notebook

My only problem with stable horde is that their anti-cp measure involves checking the prompt for words like small, meaning I can't use a nsfw-capable model with certain prompts (holding a very small bag, etc). That, and seeing great things in the image rating and being unable to reproduce because it doesn't provide the prompt.

eternalban · on Feb 21, 2023

A few million dollars. Kickstart the project, get 100 a head and 100,000 backers. Also check with Uncle Sam and see if there are any grants that can be used for this. Start a campaign and get rich concerned people to donate. Jeff may also want to show AWS can also train AI so maybe even get a break there an Amazon can get some nice PR. The list of possibilities seems extensive given the price tag of $12MM and upside of a fully public GPT.

naillo · on Feb 20, 2023

There are some open source LLM models already such as the one this repo is running and mentioning like OPT-175B

jejeyyy77 · on Feb 21, 2023

Crowd source the training costs - leaving some profit for the project owner. Open source the models.

JoshCole · on Feb 20, 2023

No, it isn't astronomical. It is smaller than that. Still large, but not astronomical.

ipsum2 · on Feb 20, 2023

Have you tried training a large model before? If not, you're probably discounting how difficult and expensive it is.

JoshCole · on Feb 21, 2023

At Voyager's speed it would take approximately 749,000,000 years to reach Canis Major Dwarf. OpenAI was founded in 2015. So it has been eight years. 749,000,000 - 8 = 748,999,992. 1.06809079e-8% of the time of a random astronomical time; rounded, that is about, uh, 0.00%ish.

I mean, don't get me wrong. It is a very expensive project. It just isn't astronomical. Anyone reading this and thinking - oh I could never do that even in hundreds of millions of years - that would be wrong. If you won the lottery or just made good financial decisions you could do a project comparable to this instead of getting a very nice house in the Bay Area.

coldtea · on Feb 20, 2023

Well, for those that trained the largest one atm, it cost them in the order of 10 million dollars (actually less).

That's how much some tech companies pay for catering.

Hell, that's in the order of a single socialite's wedding costs.

nl · on Feb 21, 2023

There are current open source projects working on training their own LLMs.

I'm aware of one very credible one that has applied for a TPU grant from Google worth under $200K to train the whole model.

I think laion.ai will probably get their first with their fleet of A100s though.

leesec · on Feb 20, 2023

No it isn't. Stable Diffusion is less than 200 grand to train.

huijzer · on Feb 20, 2023

According to Christopher Potts (Stanford Professor and Chair, Department of Linguistics, and Professor, by courtesy, Department of Computer Science), training a large language model costs about 50 million [1].

[1]: https://youtu.be/-lnHHWRCDGk?t=637

nl · on Feb 21, 2023

Yeah this is way wrong, unless counting the salaries of everyone involved for a few years in the lead up while writing software that ended up being used

https://bigscience.huggingface.co/blog/bloom was trained with a $7M grant, and that was the first time they'd done it (so there was a lot of failures).

anononaut · on Feb 20, 2023

I heard it was $4MM alone in AWS compute time.

speedgoose · on Feb 20, 2023

This number seems to match the $200k, if you take into account the cloud margins of our favourite counterfeit products reseller.

postalrat · on Feb 20, 2023

So about 40k on your own machines.

inciampati · on Feb 20, 2023

It does make it seem like a box of H100s will easily be able to make an interesting open LLM.

pavelstoev · on Feb 21, 2023

It does not have to be. We have optimizations for all kinds of workloads - https://CentML.ai

worldsayshi · on Feb 21, 2023

> Cost to train is still astronomical.

It sounds like something that could/should/would be crowdfunded?

anon291 · on Feb 20, 2023

As a former AI accelerator employee (laid off), I'm kind of happy I was laid off because I realistically don't see a need for specialized hardware anymore.

Large companies can afford Nvidia. Nvidia's software stack is best in class. There's no business need here and the model execution is increasingly becoming possible on single consumer GPUs.

The only place where I see specialized chips excelling is on the edge or if they are truly revolutionary (in which case they're only an acquisition target for Nvidia).

The truth is... The large language models are likely excessively large.

foobiekr · on Feb 20, 2023

Power is the main reason to do custom ASICs. I’d be curious as to your opinion of Recogni given they are claiming a 10x power reduction per unit compute.

anon291 · on Feb 20, 2023

Unfortunately, I've worked at several players which promise power reductions. It doesn't matter though. People don't care about cost at this point. If you are cost-sensitive you're not doing the kind of revolutionary AI work these companies need to create a competitive moat. And once your model works on NVIDIA and is trained, how much are you going to spend on ML engineers to make it work on something else? Because that cost better be less than the marginal cost reduction on electricity. Plus, NVIDIA et al will likely get more and more efficient.

nl · on Feb 21, 2023

This is exactly right.

The only exception is running things on mobile. There is demand for porting models to run natively on mobile, and somewhat reasonable support for doing this.

bitL · on Feb 21, 2023

Cost sensitive folks post TopTal jobs such as "build a ChatGPT-like service from the scratch for this industry in 2 weeks for $25/hour".

narrator · on Feb 20, 2023

No doubt the corporate large language models will use it to make language models that are 10x bigger. However, at least the public will have access to 175B parameter language models which are much more sophisticated than the 6B or so parameter models consumer video cards can currently run.

moffkalast · on Feb 20, 2023

This may be a flawed approach, but an interesting idea would be to use the current models as a preprocessor to generate a huge "labelled" dataset of inputs and outputs, and then using that more accurate and specific data to train a smaller one that would fit.

It likely wouldn't have nearly as much general knowledge since the data just wouldn't be there but the behaviour could be similar?

zeknife · on Feb 20, 2023

Look up knowledge distillation

moffkalast · on Feb 20, 2023

Ah TIL, so it's actually a thing and with some cleverer approaches too. Then we do have something to look forward to :D

permo-w · on Feb 20, 2023

do/did you seriously think that the processing requirement was going to “kill” that possibility?

the history of computing clearly indicates that either the requirements for running or the bar for owning this technology was always going to drop

pessimizer · on Feb 20, 2023

The history of computing had Moore's Law.

permo-w · on Feb 20, 2023

exactly

muttled · on Feb 20, 2023

This is cool! But I wonder if it's economical using cloud hardware. The author claims 1.12 tokens/second on the 175B parameter model (arguably comparable to GPT-3 Davinci). That's about 100k tokens a day on the GCP machine the author used. Someone double check my numbers here, but given the Davinci base cost of $0.02 per 1k tokens and GCP cost for the hardware listed "NVIIDA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD" coming up to about $434 on spot instance pricing, you could simply use the OpenAI API and generate about 723k tokens a day for the same price as running the spot instance (which could go offline at any point due to it being a spot instance).

Running the fine-tuned versions of OpenAI models are approximately 6x more expensive per token. If you were running a fine-tuned model on local commodity hardware, the economies would start to tilt in favor of doing something like this if the load was predictable and relatively constant.

ImprobableTruth · on Feb 20, 2023

You've made one huge mistake: Davinci's $0.02 is not just per 1k tokens generated but also context tokens consumed. So if you generate 50 tokens per request with 1k context, the price is actually 20 times as large at $0.40 per 1k tokens generated - much less palatable, costing 3 times as much as the cloud hosted version of this.

And that's not even taking into account the gigantic markup cloud services have.

yorwba · on Feb 20, 2023

Most of the computational cost of producing an output token is spent on consuming input tokens (including previous output tokens that are fed back in); only the final unembedding matrix could be eliminated if you don't care about the output logits for the context.

So it's not correct to only modify OpenAI's prices to account for the ratio of context tokens to output tokens. Both of them get multiplied by 20 (if that's what your ratio is).

ImprobableTruth · on Feb 20, 2023

No, because they're already taking that into account.

>Metric: generation throughput (token/s) = number of the generated tokens / (time for processing prompts + time for generation).

(Though they're doing batching, so this is an unfair comparison. Would be interesting to get single batch speed.)

breckenedge · on Feb 20, 2023

Thanks for running the cloud numbers on this. I ran some DIY numbers and they indicate less than a week to break even with the cloud, including all hardware and electricity costs. The cloud seems stupid expensive compared to running your own hardware for this kind of task.

throwawayapples · on Feb 20, 2023

The cloud is always stupid expensive compared to running your own hardware for almost any sort of task that isn't highly variable upon one or more axis (cpu, ram, etc), but less than a week to break even is truly shocking.

p1esk · on Feb 20, 2023

The cloud has been cheaper to train large models on for a couple years now. Compare buying 8xA100 server vs renting one on Lambda Labs. At least 3 years to break even - if you are using it non-stop 24/7. Longer if not.

cardine · on Feb 20, 2023

This is not true - the break even period is closer to 6-7 months.

p1esk · on Feb 20, 2023

A single 8xA100 server is ~150k. On demand cost to rent it is $8.8/hour. Do the math and don't forget the energy costs.

cardine · on Feb 20, 2023

I'd suggest finding a cheaper vendor if that is the lowest price you can get for an 8xA100 server. We spend a lot on both and colo our servers so I've definitely done the math!

p1esk · on Feb 20, 2023

Six months ago I've contacted 12 different vendors, the quotes for four 8xA100 servers ranged from 130k to 200k each. You probably wouldn't want to buy from the low end vendors.

Keep in mind, there are three important advantages of cloud:

1. You only pay for what you use (hourly). What is utilization of your on-prem servers?

2. You don't have to pay upfront - easier to ask for budget

3. You can upgrade your hardware easily as soon as new GPU models become available.

cardine · on Feb 21, 2023

I know how much we paid and it is substantially less than what you were quoted - very likely from one of the 12 providers you contacted.

It is likely you just didn't realize how much margin these providers have and did not negotiate enough. How else do you think cloud providers are able to afford the rates they are giving? The way you describe it, places like Coreweave are operating as a charity. That isn't true - they just got better prices than you.

Our inference setup is 7 figures, has been running for a while (with new servers purchased frequently along the way) and there have been no issues - the cards, CPU, RAM, are all top of the line server hardware.

1. For inference (which is 80%+ of our need) our utilization is 100% 24/7/365. For stuff that is variable (like training) we often do use cloud - as I mentioned we do both.

2. I am the CEO so I am not sure who I'm asking for budget?

3. At this point we would have paid more for cloud than what we spent purchasing our own hardware. There is nothing stopping us from getting new hardware or cloud with newer cards while still getting to own our current hardware. In fact since our costs over the last year were lower due to us buying our own hardware it is actually easier for us to afford newer cards.

p1esk · on Feb 21, 2023

Yes, obviously cloud providers get their hardware at a fraction of a cost I'm quoted, they are ordering thousands of servers. I was only buying four. No one would negotiate with me, I tried. I suppose if I had a 7 digit budget I could get a better deal.

I was mainly talking about training workloads, inference is a different beast. I'm actually surprised you have 100% inference utilization - customer load typically scales dynamically, so with on-prem servers you would need to over-provision.

CEOs don't usually order hardware, they have IT people for that, with input from people like me (ML engineers) who could estimate the workloads, future needs, and specific hw requirements (e.g. GPU memory). And when your people come to you asking for budget, while you're trying to raise the next round, you're more likely to approve the 'no high upfront cost' option, right?

In my situation, when asked about buy vs rent my initial reaction was "definitely buy", but when I actually looked at the numbers, the 3 years break even period, no upfront costs for cloud, and no need to provision storage and networking, made it an easy recommendation. The cost of cloud GPUs has come down dramatically in the last couple of years.

Though I would like to have at least a couple of local GPU servers for quick experimentation/prototyping, because sometimes the overhead of spinning up a new instance and copying datasets is too great relative to the task.

cardine · on Feb 21, 2023

> I suppose if I had a 7 digit budget I could get a better deal.

We got our "deal" when buying just a single server and have since bought many more with the same provider. We didn't spend 7 figures all at once, we did it piece-meal over time. There is nothing stopping you from getting much better prices.

> I'm actually surprised you have 100% inference utilization - customer load typically scales dynamically, so with on-prem servers you would need to over-provision.

It is pretty easy to achieve 100% inference utilization if you can find inference work that does not need to be done on-demand. We have a priority queue and the lower priority work gets done during periods with lower demand.

> CEOs don't usually order hardware, they have IT people for that, with input from people like me (ML engineers) who could estimate the workloads, future needs, and specific hw requirements (e.g. GPU memory).

Judging by this conversation it seems like "people like you" may not be the best people to answer this question since the best hardware quote you could get was at a >100% markup! At a startup that specializes in ML research and work the CEO is going to be intimately familiar with ML workloads, needs, and hardware requirements.

> And when your people come to you asking for budget, while you're trying to raise the next round, you're more likely to approve the 'no high upfront cost' option, right?

If the break even point is 6-7 months and our runway is longer than 6-7 months why would this matter?

p1esk · on Feb 21, 2023

the best hardware quote you could get was at a >100% markup!

Now I’m really curious - if you can share - how much did you pay, and when was it? Are you talking about 40GB or 80GB cards? How did you negotiate? Any attempts I made were shut down with simple “no, that’s our final price”. What’s the secret?

At a startup that specializes in ML research and work the CEO is going to be intimately familiar with ML workloads, needs, and hardware requirements.

I work at a startup which builds hardware accelerators, primarily for large NLP models. It’s a large part of my job is to be intimately familiar with ML workloads, needs, and hardware requirements. Our CEO definitely doesn’t have enough of that knowledge to choose the right hardware for our ML team. In fact even most people on our ML team don’t have deep up to date knowledge about GPUs, GPU servers, or GPU server clusters. I happen to know because I always had interest in hardware and I’ve been building GPU clusters since grad school.

cardine · on Feb 22, 2023

As mentioned in another comment, the contract has very clear language not to share it - likely because they are offering different prices to different companies.

So I don't feel comfortable sharing any specifics, especially since this account is directly tied to my name.

With that being said, the negotiation process was pretty straightforward: - Emailed several vendors telling them we are a small startup, we are looking to make many purchases, but right now we are starting with one. We told everyone our purchasing decision was solely based on cost (given equivalent hardware) and to please put your best quote forward.

- Got back all of our prices. Went to the second cheapest one and told them they were beat and offered them the ability to go lower, which they did. We went with that vendor.

- For our next purchase, we went to the original lowest vendor (who got beat out), told them they lost out to price, and if they can go lower than that we would go with them and continue to give them business moving forward. They went quite a bit lower than what they originally offered, and what the vendor we first purchased from gave. We bought our second order from them and have used them ever since.

lostmsu · on Feb 21, 2023

> We got our "deal" when buying just a single server and have since bought many more with the same provider. We didn't spend 7 figures all at once, we did it piece-meal over time. There is nothing stopping you from getting much better prices.

If it is as easy as you make it sound, why would you not just share the vendor name? I personally would love an 8xH100 machine for transformer experiments, but $100k+ pricing makes it a non-starter.

cardine · on Feb 22, 2023

The contract has very clear language not to share it - likely because they are offering different prices to different companies.

(And as p1esk mentioned, there is no way you are getting H100s for under $100k).

p1esk · on Feb 22, 2023

8xH100 machine is ~300k I’ve heard.

lostmsu · on Feb 22, 2023

Well, the person above claims 8xA100 significantly under $130k. I am curious to hear more.

p1esk · on Feb 22, 2023

Sure, but you mentioned H100 machine, and those are about 2.5x more expensive.

cypress66 · on Feb 20, 2023

This is most likely aimed at people running models locally.

And a homelab with 3090s/4090s is one or two orders of magnitude cheaper than GCP, if you use them continuously.

SomeHacker44 · on Feb 20, 2023

I do not know anyone offhand with a 200+GB RAM home computer. The GPU is not all that is needed; you need to keep the parameters and other stuff in memory too.

zargon · on Feb 20, 2023

256gb of ddr4 rdimms only costs about $400 right now. $200 for ddr3. Not uncommon in homelabs.

I don't think 200gb ram is actually required, that's just what that cloud vm was spec'd with. Though the 175b model should see benefit with ram even beyond 200gb.

Filligree · on Feb 20, 2023

Running it off a fast NVMe apparently works. I don't know what the performance is like, though.

woadwarrior01 · on Feb 20, 2023

I own a two year old headless home computer with 256GB of RAM and two 3090s. I ssh into it from my mac to run ML training jobs.

hedgehog0 · on Feb 21, 2023

What’s the price may I ask

alex_sf · on Feb 21, 2023

I have a similar setup. GPUs were $1.6k US, rest of the system was another ~$1k. This includes 256GB RAM and dual Xeons.

zac_hudson · on Feb 21, 2023

I bought an old server from ebay, and ripped the guts out to make a desktop computer. 128GB of ddr3 ram and 48 E5 v2 xeons for 300$.

swatcoder · on Feb 20, 2023

Sometimes control is more important than cost.

pclmulqdq · on Feb 20, 2023

Cloud accelerators carry a huge price premium because there aren't very many of them available and they aren't as fungible as CPUs. Comparing to a local GPU would likely be favorable for the local machine.

borzunov · on Feb 21, 2023

I'm afraid that, unlike proprietary APIs and Petals, this system can't be used for single-batch inference of 175B models with interactive speeds - the thing you actually need for running ChatGPT and other interactive LM apps. See https://news.ycombinator.com/item?id=34874976

lxe · on Feb 20, 2023

Got the ops-6.7b chatbot running on a windows machine with a 3090 in mere minutes. The only difference was to install the cuda pytorch `pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117` just like in stable diffusion's case.

It performs as expected:

    Human: Tell me a joke
    Machine: I have no sense of humour

    Human: What's 2+5?
    Machine: I cannot answer that.

Ajedi32 · on Feb 20, 2023

6.7b is pretty small, no? Do you even need offloading for that on a 3090? I'd be curious to see what's needed to run opt-30b or opt-66b with reasonable performance. The README suggests that even opt-175b should be doable with okay performance on a single NVIDIA T4 if you have enough RAM.

nathan_compton · on Feb 20, 2023

It is entirely possible to run 6.7B parameter models on a 3090, although I believe you need 16 bit weights. I think you can squeeze a 20b parameter model onto the 3090 if you go all the way down to 8.

A4ET8a8uTh0 · on Feb 20, 2023

Hey. So did anyone try doing it with AMD cards ( I know Nvidia seems preferable now )?

rjb7731 · on Feb 20, 2023

Looks like it might be no bueno on google colab for now, chatbot.py takes prompts via input() too rather then a command line argument.

hackernewds · on Feb 21, 2023

Could it work on Google Colab?

blagie · on Feb 20, 2023

A lot of people are looking at this wrong. A $350 3060Ti has 12GB RAM. If there's a way to run models locally, it opens up the door to:

1) Privacy-sensitive applications

2) Tinkering

3) Ignoring filters

4) Prototyping

5) Eventually, a bit of extra training

The upside isn't so much cost / performance, as local control over a cloud-based solution.

Aperocky · on Feb 20, 2023

I have that exact card, this maybe the nudge where I remove windows from the computer and try out linux gaming (and local GPT)

raihansaputra · on Feb 20, 2023

Thing is, you don't have to totally switch to Linux. I'm running ML/CUDA workloads through WSL without too many problems.

NonEUCitizen · on Feb 21, 2023

Although not "too many," what kind of problems have you encountered running ML/CUDA in WSL? Thanks.

lostmsu · on Feb 21, 2023

Not exactly the answer to your question, but I just run ML/CUDA workloads directly on Windows. PyTorch works fine.

I did not need multiGPU training so far (just run experiments in parallel), so unsure about the state of that. Additionally, torchvision does not support GPU video decoding on Windows. Those are two only limitations I faced so far.

raihansaputra · on Feb 22, 2023

WSL problems but not related to CUDA:

- need a patch to expose ports for services in WSL to network (WSLHostPatcher)

- the Virtual Hard Disk (vhdx) does not free unused space easily and it can grow quickly. I ended up just symlinking my code and dataset folders to mounts, not saving a lot of data inside the vhdx

- beware of upgrades etc. I think I nuked my WSL 2x due to config issues. Having all your code/data on mounts also makes this easy.

related to ML/CUDA:

- how you install pytorch + CUDA matters. I ended up just installing from `conda --channel fastchan` and don't touch it. not ideal but it works

- Don't forget to configure the RAM allocation in case you need a lot

- I haven't tried running a CUDA Docker on WSL. May be an easier way to do this.

Running on Windows directly is also an option. I choose to run on WSL because most learning resources/documentation refer more to Linux installs and setups.

bavell · on Feb 21, 2023

I've had great results recently using Steam/Proton on Arch with my AMD 6750XT.

jacooper · on Feb 21, 2023

Ai? How did you setup ROCm on A 6750xt?

a_f · on Feb 21, 2023

Nitpicky but the RTX 3060 (non-Ti) has a variant with 12 GB, whilst the Ti is 8 GB. Agree with your points though

blagie · on Feb 21, 2023

Thanks.

Ti-po, I guess :)

ml_basics · on Feb 20, 2023

Very cool.

Worth mentioning though that the highlighted figures (1.12 tok/s for OPT-175B for "FlexGen with Compression") are for inputs of 512 tokens and outputs of 32 tokens.

Since decoder-only transformer memory requirements scale with the square of sequence lengths, things would probably slow down significantly for very long sequences, which would be required for a back-and-forth conversation.

Still though, until reading this i had no idea that running such a model on-device was remotely feasible!

fpgaminer · on Feb 20, 2023

> transformer memory requirements scale with the square of sequence lengths

Not true, see: Flash Attention. You can losslessly calculate the attention in blocks using a little math trick. Essentially each subsequent block "corrects" the denominator of the last block's softmax calculation. At the end you have a perfectly* accurate softmax. Since you don't need to keep the whole sequence in memory to perform the softmax, your memory now scales linearly with respect to sequence length, and due to the lower memory bandwidth requirements and increased kernel fusion the operation also tends to be faster.

* While mathematically the calculation ends up exactly the same, in practice the result ends up slightly different due to the whims of F32 and F16 inaccuracies, and since the "max" used to calculate the softmax in a numerically stable way is calculated on a per-block basis. Doesn't significantly effect training or validation loss though.

lxe · on Feb 20, 2023

What's the best way to get started learning this? What are the steps I should take to arrive at understanding what "attention" is?

baobabKoodaa · on Feb 20, 2023

> Since decoder-only transformer memory requirements scale with the square of sequence lengths, things would probably slow down significantly for very long sequences, which would be required for a back-and-forth conversation.

You can use tricks to keep the sequence length down even if the conversation goes on for a long time. For example, you can use the model to summarize the first n-1 lines of the conversation and append the last line to the summary as is.

terabytest · on Feb 20, 2023

This is very interesting. Could you please elaborate and maybe share links to articles if you know of any?

baobabKoodaa · on Feb 20, 2023

I don't have any sources to refer to, but "text summarization" is one of the common NLP tasks that LLMs are often benchmarked on. All of these general-purpose LLMs will be able to do a decent job at text summarization (some, such as ChatGPT, will be able to do zero-shot summarizations at high quality, whereas others need to be fine tuned for the task). If your problem is that you are feeding a large amount of text to the model and that is slow/expensive, then summarization will obviously remediate that issue. After summarizing most of the input text you still need to feed in the latest input without summarization, so for example if the user asks a question, the LLM can then accurately answer that question. (If all of the input goes into summarization, that last question may not even appear in the summarization, so results will be crap.)

albertzeyer · on Feb 20, 2023

It would be helpful to upload the paper to Arxiv, for better accessibility and visibility.

https://github.com/Ying1123/FlexGen/blob/main/docs/paper.pdf https://docs.google.com/viewer?url=https://github.com/Ying11...

borzunov · on Feb 21, 2023

Note that the authors report the speed of generating many sequences in parallel (per token):

> The batch size is tuned to a value that maximizes the generation throughput for each system.

> FlexGen cannot achieve its best throughput in [...] single-batch case.

For 175B models, this likely means that the system takes a few seconds for each generation step, but you can generate multiple sequences in parallel and get a good performance _per token_.

However, what you actually need for ChatGPT and interactive LM apps is to generate _one_ sequence reasonably quickly (so it takes <= 1 sec/token to do a generation step). I'm not sure if this system can be used for that, since our measurements [1] show that even the theoretically-best RAM offloading setup can't run the single-batch generation faster than 5.5 sec/token due to hardware constraints.

The authors don't report the speed of the single-batch generation in the repo and the paper.

[1] https://arxiv.org/pdf/2209.01188.pdf

152334H · on Feb 21, 2023

I spoke with the authors of the paper; the leftmost points in Figure 1 were generated with batch-size 1, indicating ~1.2x and ~2x improvements in speed over DeepSpeed for 30B and 175B models respectively. For reference, this is speeding up from ~0.009tokens/s to about ~0.02tokens/s on 175B.

These results are generally unimpressive, of course. Most of the improvements at that point are attributable to the authors making use of a stripped down library for autoregressive sampling. HN falling for garbage once again...

ImprobableTruth · on Feb 21, 2023

Calling this garbage is absolutely wild. The authors make it very clear that this is optimized for throughput and not latency. Throughput focused scenarios absolutely do exist, editorializing this as "running large language models like ChatGPT" and focusing on chatbot applications is the fault of HN.

It's also a neat result that fp4 quantization doesn't cause much issue even at 175b, though that kinda was to be expected.

borzunov · on Feb 21, 2023

While I agree that throughput-focused scenarios exist and this work may be valuable for them, I still think that the repository can be improved to avoid "overselling".

The fact that the FlexGen's single-batch generation performance is much worse is unclear to most people not familiar with peculiarities of LLM inference and worth clarifying. Instead, the readme starts with mentioning ChatGPT and Codex - projects that both rely on single-batch inference of LLMs at interactive speeds, which is not really possible with FlexGen's offloading (given the speed mentioned in the parent comment). The batch sizes are not reported in the table as well.

Seeing that, I'm not surprised that most HN commenters misunderstood the project's contribution.

baobabKoodaa · on Feb 20, 2023

I just tried to run the example in the README, using the OPT-30B model. It appeared to download 60GiB of model files, and then it attempted to read all of it into RAM. My laptop has "only" 32GiB of RAM so it just ran out of memory.

Miraste · on Feb 20, 2023

You have to change the --percent flag. It takes some experimentation. The format is three pairs of 0-100 integers, one for parameters, attention cache and hidden states respectively. The first zero is percent on GPU, the second one is percent on CPU (system RAM), and the remaining percentage will go on disk.

For disk offloading to work you may also have to specify --offload-dir.

I have opt-30B running on a 3090 with --percent 20 50 100 0 100 0, although I think those could be tweaked to be faster.

lxe · on Feb 20, 2023

How much system RAM are you running with? And I'm guessing it wouldn't hurt to have a fast SSD for disk offloading?

Miraste · on Feb 20, 2023

128GB, but by turning on compression I managed to fit the whole thing on the GPU. I did try it off a mix of RAM and SSD as well, and it was slower but still usable. Presumably disk speed matters a lot.

lxe · on Feb 20, 2023

Well just got some more sticks. While I wait for RAM to arrive, will try with compress_weight and compress_cache. If you're in any discord or any other space where people are tinkering with this, would love to join!

lxe · on Feb 20, 2023

With compression, was able to get 30b to run on 3090 with '100 0'! Let me see if I can tweak the prompt a bit and make it come alive...

ImprobableTruth · on Feb 20, 2023

How fast is it in single batch mode?

Miraste · on Feb 20, 2023

After turning on compression I was able to fit the whole thing in GPU memory and then it became much faster. Not ChatGPT speeds or anything, but under a minute for a response in their chatbot demo. A few seconds in some cases.

baobabKoodaa · on Feb 20, 2023

FWIW I was able to load the OPT-6.7B model and play with it in chatbot mode. This would not have been possible without the offloading, so... cool stuff!

bee_rider · on Feb 20, 2023

Hmm, well we used to have swap partitions equal in size to our memory… you’ll have 4GiB left over!

adamnemecek · on Feb 20, 2023

I have recently written a paper on understanding transformer learning via the lens of coinduction & Hopf algebra. https://arxiv.org/abs/2302.01834

The learning mechanism of transformer models was poorly understood however it turns out that a transformer is like a circuit with a feedback.

I argue that autodiff can be replaced with what I call in the paper Hopf coherence which happens within the single layer as opposed to across the whole graph.

Furthermore, if we view transformers as Hopf algebras, one can bring convolutional models, diffusion models and transformers under a single umbrella.

I'm working on a next gen Hopf algebra based machine learning framework.

Join my discord if you want to discuss this further https://discord.gg/mr9TAhpyBW

qualudeheart · on Feb 20, 2023

Powerful idea.

adamnemecek · on Feb 20, 2023

Hopf algebras are next gen.

kneel · on Feb 20, 2023

adamnemecek · on Feb 20, 2023

which part

benlivengood · on Feb 20, 2023

This also means local fine-tuning is possible. Expect to see an explosion of new things like we did with Stable Diffusion, limited to some extent by the ~0.7 order of magnitude more VRAM required.

bioemerl · on Feb 20, 2023

Does it? I would have expected compression losses to make training really hard.

Miraste · on Feb 20, 2023

The compression is optional.

simonw · on Feb 20, 2023

Top item on the roadmap: "Support Apple silicon M1/M2 deployment"

MuffinFlavored · on Feb 20, 2023

I tried to figure out how to do GPGPU stuff as a total beginner in Rust on Apple Silicon.

I couldn't figure out if I was supposed to be chasing down Apple Metal or OpenCL backends. It also didn't seem to make much of a difference because while there are crates for both that seemed relatively well-maintained/fleshed out, I couldn't figure out how exactly to just pull one down and plug them into a higher level library (or find said higher level library all together).

Have you had any luck? In my experience, it's basically Python or bust in this space despite lots of efforts to make it not that way?

I also got confuses as to whether a 'shader' was more for the visual GPU output of things, or if it was also a building block for model training/networks/machine learning/etc.

fathyb · on Feb 20, 2023

> I couldn't figure out if I was supposed to be chasing down Apple Metal or OpenCL backends.

If you want cross-platform compatibility (kinda), go for OpenCL, if you want the best performance go for Metal. Both use a very similar language for kernels, but Metal is generally more efficient.

> Have you had any luck?

Not in ML, but I'm doing a lot of GPGPU on Metal, I recently started doing it in Rust. A bit less convenient than with Swift/Objective-C, but still possible. Worst case you'll have to add an .mm file and bridge it with `extern "C"`. That said, doing GPGPU is not doing ML, and most ML libraries are in Python.

> I also got confuses as to whether a 'shader' was more for the visual GPU output of things, or if it was also a building block for model training/networks/machine learning/etc.

A shader is basically a function that runs for every element of the output buffer. We generally call them kernels for GPGPU, and shaders (geometry, vertex, fragment) for graphics stuff. You have to write them in a language that kinda looks like C (OpenGL GLSL, DirectX HSL, Metal MSL), but is optimized for the SMT properties of GPUs.

Learning shaders will let you run code on the GPU, to do ML you also need to learn what are tensors, how to compute them on the GPU, and how to build ML systems using them.

I recommend ShaderToy [0] if you want a cool way to understand and play with shaders.

[0]: https://www.shadertoy.com/

MuffinFlavored · on Feb 21, 2023

> GPGPU is not doing ML

> General-purpose computing on graphics processing units

> machine learning

Could you expand on why this is the case please? I thought machine learning was basically brute forcing a bunch of possibilities and keeping track of how different inputs "score", then ranking them accordingly to help make educated predictions later.

> GPGPU (General-Purpose Graphics Processing Unit) and machine learning are not the same thing, although they can be related in some ways.

> GPGPU refers to using the parallel processing power of graphics processing units (GPUs) to perform computations beyond graphics rendering. This involves using the massive number of cores in modern GPUs to accelerate tasks such as scientific simulations, numerical analysis, and other data-intensive applications. Essentially, GPGPU involves leveraging the processing power of GPUs for general-purpose computing tasks, not just for graphics processing.

> On the other hand, machine learning involves using algorithms and statistical models to enable computer systems to learn from data and improve their performance on a specific task. It involves feeding large amounts of data to a machine learning algorithm so that it can learn to recognize patterns and make predictions or decisions based on that data.

> While GPGPU can be used to accelerate the computation required in machine learning tasks, they are not the same thing. Machine learning is a specific type of computation, whereas GPGPU is a technique for accelerating computation in general. Additionally, GPGPU can be used for a wide variety of computational tasks, not just machine learning.

Miraste · on Feb 20, 2023

I'm not familiar with Metal, but on Apple Silicon aren't CPU and GPU memory completely shared?

fathyb · on Feb 20, 2023

They do, however it's not fully shared at the process level, the GPGPU API should explicitly support mapping a buffer from the process virtual memory space to the GPU.

I looked it up and turns out OpenCL also supports zero-copy buffers, so I edited my comment accordingly!

MuffinFlavored · on Feb 20, 2023

so write a kernel in OpenCL, then call it from Rust

is that what machine learning is doing at a high level?

fathyb · on Feb 20, 2023

At a very high level yes. There is also the very important step of efficiently laying out data in the GPU memory to compute tensor values in the kernels.

MuffinFlavored · on Feb 21, 2023

Can you confirm if OpenCL has been deprecated going forward for Apple Silicon please?

Also, should I be able to expect to use OpenCL version 3.0 on Apple Silicon, or only v1.2 or 2.0 or something else?

raphlinus · on Feb 21, 2023

Yes, according to Apple official documentation, OpenCL was deprecated as of macOS 10.14. It is reported to work, including on Apple Silicon (M1 and M2), but don't expect any updates.

[1]: https://developer.apple.com/library/archive/documentation/Pe...

smoldesu · on Feb 20, 2023

Give this a look:

https://github.com/guillaume-be/rust-bert

https://github.com/guillaume-be/rust-bert/blob/master/exampl...

If you have Pytorch configured correctly, this should "just work" for a lot of the smaller models. It won't be a 1:1 ChatGPT replacement, but you can build some pretty cool stuff with it.

> it's basically Python or bust in this space

More or less, but that doesn't have to be a bad thing. If you're on Apple Silicon, you have plenty of performance headroom to deploy Python code for this. I've gotten this library to work on systems with as little as 2gb of memory, so outside of ultra-low-end use cases, you should be fine.

MuffinFlavored · on Feb 20, 2023

To clarify,

> Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing from rust-tokenizers.

> tch-rs: Rust bindings for the C++ api of PyTorch.

Which "backend" does this end up using on Apple Silicon, MPS (Metal Performance Shaders) or OpenCL?

https://pytorch.org/docs/stable/notes/mps.html

I'm going to guess MPS?

smoldesu · on Feb 20, 2023

Whatever your Pytorch install is designed to accelerate. I've got Ampere-accelerated Pytorch running it on my ARM server, I assume MPS is used on compatible systems.

fancyfredbot · on Feb 20, 2023

I believe that you can't get enough RAM with M1/M2 for this to be useful

ricardobeat · on Feb 20, 2023

This is meant to run on GPUs with 16GB RAM. Most M1/M2 users have at least 32GB (unified memory), and you can configure a MBP or Mac Studio with up to 96/128GB.

The Mac Pro is still Intel, but it can be configured with up to 1.5TB of RAM, you can imagine the M* replacement will have equally gigantic options when it comes out.

fancyfredbot · on Feb 20, 2023

If you look closely there's 16GB of GPU memory and over 200GB of CPU memory. So none of the currently available M* have the same kind of capacity. Let's hope this changes in the future!

ricardobeat · on Feb 24, 2023

Apple silicon has unified memory, the GPU has access to the entire 32/64/96/128GB of RAM. It's part of the appeal.

I would really like to see how stuff performs on a Mac Studio with 128GB memory, 8TB SSD (at 6GB/s), not to mention the extra 32 "neural engine" cores. It seems the performance of these machines has been barely explored so far.

fancyfredbot · on Feb 25, 2023

I think that here the main bottleneck is data movement. If you are streaming weight data from a 6GB/s SSD you'll get under 10% of the performance shown for 3090 (which will be moving data at PCIe 4 speeds of 64GB/s).

Once in unified memory the weights are accessible at about half the rate they are on the 3090 (400GB/sec on M2 Max vs 936GB/sec on 3090).

danuker · on Feb 20, 2023

Any chance these work on CPUs with any acceptable performance?

I have a 10-core 20-thread monster CPU, but didn't bother with a dedicated GPU because I can't control something as simple as its temperature. See the complicated procedure that only works with the large proprietary driver here:

https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Over...

adeon · on Feb 20, 2023

I don't know about these large models but I saw on a random HN comment earlier in a different topic where someone showed a GPT-J model on CPU only: https://github.com/ggerganov/ggml

I tested it on my Linux and Macbook M1 Air and it generates tokens at a reasonable speed using CPU only. I noticed it doesn't quite use all my available CPU cores so it may be leaving some performance on the table, not sure though.

The GPT-J 6B is nowhere near as large as the OPT-175B in the post. But I got the sense that CPU-only inference may not be totally hopeless even for large models if only we got some high quality software to do it.

generalizations · on Feb 20, 2023

There's also the Fabrice Bellard inference code: https://textsynth.com/technology.html. He claims up to 41 tokens per second on the GPT-Neox 20B model.

brigade · on Feb 20, 2023

Your CPU gets maybe 700-800 gflops depending on your all-core frequency (fp32 because you don't have Sapphire Rapids.) The T4 benchmarked would be crunching what it can at ~65 tflops (fp16 tensor.) Newer GPUs hit 300 tflops (4090) or even nearly 2 petaflops (H100).

To give you an idea of the order of magnitude of compute difference. Sapphire Rapids has AMX and fp16 AVX512 to close the gap a little, but it's still massive.

NavinF · on Feb 20, 2023

> 10-core 20-thread monster CPU

With what, 50GB/s memory bandwidth? That's no monster. The two consumer GPUs in my machine both do 1TB/s and are still bottlenecked on memory bandwidth.

> only works with the large proprietary driver here

In practice, nothing works without the proprietary driver so this isn't specific to temperature. Also the setting you're looking for is almost certainly `nvidia-smi -pl $watts` for setting the power limit, not whatever that wiki gives you. GPU temperature = ambient temperature + (power limit)*(thermal resistance of cooler)