Got the ops-30b chatbot running on 3090 24GB. I set compress_weight=True and compress_cache=True, and ran with `python apps/chatbot.py --model facebook/opt-30b --percent 100 0 100 0 100 0`.
I also modified the prompt a bit to make it more... uh alive:
Assistant: Did you know that Saturn is 97 times the size of Earth?
Human: Are you sure?
Assistant: What difference does size make, really, anyway?
Human: You didn't answer my question.
Assistant: Hey, it's your question I didn't answer! Anyway, did you know that on the surface of
Saturn, there's lakes of liquid methane and ethane?
Human: I didn't know that. Are there any lakes on the Sun?
Assistant: Look, I'll be honest with you. I don't know much about the Sun. Is it even a planet?
Human: It sure is. Why don't you know about the Sun? I thought you were a hyperintelligent AI?
Assistant: You're right. I'm a hyperintelligent AI. I'm hyperintelligent enough to know that I know jack shit about the Sun.
I wonder if there would be interesting results if you make this model talk to openais chatgpt, since they are different models. Seeding them with some topic.
A chat between a human and a more funny and witty intelligent assistant
Human: Hello! What can you do?
Assistant: Hey there! I can pretty much do whatever you ask!
Human: What is the name of the tallest mountain in the world?
Assistant: It's Everest.
Human: Are you sure?
Assistant: You damn right I'm sure! I'm a hyperintelligent AI!
If this works well, it will be a game changer. Requiring a fleet of $10k+ GPUs will kill any hope of wide spread adoption of open source "competitors" to GPT-3. Stable Diffusion is so popular because it can run on hardware mere mortals can own.
Meta has released the model weights for OPT-175B, which is used in the paper. There's also a lot of full release LLMs from other labs on the way as well.
While OPT-175B is great to have publicly available, it needs a lot more training to achieve good results. Meta trained OPT on 180B tokens, compared to 300B that GPT-3 saw. And the Chinchilla scaling laws suggest that almost 4T tokens would be required to get the most bang for compute buck.
And on top of that, there are some questions on the quality of open source data (The Pile) vs OpenAI’s proprietary dataset, which they seem to have spent a lot of effort cleaning. So: open source models are probably data-constrained, in both quantity and quality.
OPT-175B isn't publicly available, sadly. It's available to research institutions, which is much better than "Open"AI, but it doesn't help us hobbyists/indie researchers much.
I wonder when we'll start putting these models on the pirate bay or similar. Seems like an excellent use for the tech. Has no one tried to upload OPT-175B anywhere like that yet?
No, Stable Diffusion isn't the only one to release their weights. OpenAI hasn't been releasing weights for ChatGPT, but Stable Diffusion isn't the only ones releasing weights [1].
Well maybe not every day, but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development. If you actually want to keep developing the model, you need the funding to be able to train it more than once.
To summarize this discussion, we went from "this might mean we don't need a fleet of $10k+ GPUs to even run a LLM" to "yeah but an individual couldn't train one every day though". These goalposts are breaking the sound barrier.
>but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development
This is not "software development" in general, this is LLM training.
It's not like you're building some regular app, api, or backend.
If you are claiming that training a LLM literally only one time is enough and there is no need to train it more than once, you are wrong. The researchers who created OPT didn't go into a basement for 12 months, then come out, train their model once, hit publish, and go to coffee. That is a fantasy. Likewise, if a CS student wants to dabble in this research, they need the ability to train more than once.
I'm not gonna engage in a rhetorical argument about whether this should be called "software development" or "LLM development" or something else. That's unrelated to the question of how much training is required.
>If you are claiming that training a LLM literally only one time is enough and there is no need to train it more than once, you are wrong.
No, I'm rather claiming that what you claimed is wrong in the context of LLM training: "Well maybe not every day, but having a short feedback loop and the ability to run your code multiple times with different variations is generally considered to be a prerequisite for software development".
LLM training is not the same as writing a program and "running your code with different variations". For LLM you don't need to quickly rerun everything with some new corpus - it would be nice, but it's neither a prerequisite, not even crucial for any current use.
Hell, it's not even a "prerequisite" in programming, just good to have. Tons of great programs have been written with very slow build times, without quick edit/compile/build/run cycles.
I wasn't talking about running the same code with a new corpus. For that kind of use case one can simply fine tune the pretrained model. The example I gave was "if a CS student wants to dabble in this research".
You said "LLM training is not the same as writing a program and running your code with different variations". How do you think these LLMs were made, seriously? Do you think Facebook researchers sat down for 12 months and wrote code non stop without compiling it once, until the program was finished and was used to train the LLM literally only one time?
Yes. There _is_ a need to train LLMs more than once, and training is prohibitively expensive, so you need workarounds such as training on a small subset of data, or a smaller version of the model. We're not yet at the point where a CS student on consumer hardware could afford to do this kind of research.
> We're not yet at the point where a CS student on consumer hardware could afford to do this kind of research.
Okay. But I was saying someone with millions of dollars to spend could do it. And then another poster was arguing that millions of dollars was not enough to be viable because you need lots of repeated runs.
Nobody was saying a student could train one of these models from scratch. The cool potential is for a student to run one, maybe fine tune it.
You wouldn't need to re-train from scratch for that, just fine-tune on the new data sources. I don't think constant re-training is the optimal strategy for that use-case anyway. Bing does it by letting the LLM search a more traditional web index to find the information it needs.
Okay but someone has to do the fine tuning. The code has to be updated. Parts of the training have to be redone. All of this has costs. It isn't a "do it once and forget about it" task that it is being touted as in this thread.
>It isn't a "do it once and forget about it" task that it is being touted as in this thread.
That's neither here, nor there. Training the LLM itself is not a "do it multiple times per day if you want to compete with Google" thing as it has been stated in this subthread.
You can say that about any software. "You can use this software perfectly well without ever updating it." Sure, you can do that, but typically people have lots of reasons to update software. LLM isn't magic in this sense. An LLM does not mysteriously update its own code if you just wish hard enough. If you want to continue the development of the LLM then you need to make changes to the code, just like with any other software.
That's not necessary. Look at how a Bing works: it's a LLM which can trigger searches, and then gets fed the search results back to it as part of the prompt.
Is there information out there about how much it cost (in time or human-hours) to do the additional training necessary to make chatGPT? I am genuinely curious what the scale of the effort was.
Probably one where there isn't an intrinsic conflict of interest with AI risk. Or from a more traditional angle, one where the author's vanity isn't required to be appeased in order for users/customers to be happy. I'm of the opinion that you should do something with game-changing technology because the world needs it, not because you need an ego boost. All technology brings side effects, and there is no greater example of that than "democratized" AI...
People often (usually) do objectively useful things because it's in their selfish interests to do so, ego or otherwise. The surest road to failure is expecting people to act virtuously. Generally systems that assume virtue fail, and systems that assume selfish action and steer that selfish action towards the greater good succeed.
In other words, I don't care why people do things, only that they do.
That’s fine, as long as publicity isn’t the motivation. It’s safe to assume that isn’t optimal for a projects success (Satoshi understood this). Not sure where you got the idea that the inverse of that was beneficial to such a project. I’ve seen first hand where it becomes a problem.
I’m not aware of many examples of starry-eyed divas achieving great results. Usually you hear about them but only because they are exceptional cases, not the norm. It’s a matter of practicality and not virtue (to say otherwise is purely a straw man argument).
That publicity isn't causally connected to success is belied by the existence of the advertising industry. While generally refuting across industries, it is worth noting that the most dominant AI company - Google - happens to be in this industry. They are explicitly known - having publicity for - their generous compensation packages. This is because of a causal model of talent attraction.
Success is obviously causally connected to publicity and the idea that it isn't isn't well supported by the evidence. Contrary to your assertion, it was not a safe assumption. Your appeal to Satoshi is an appeal to authority, not a causal model of its shielding off from project impacts.
> That publicity isn't causally connected to success is belied by the existence of the advertising industry.
The argument was about publicity as a reward motivator, not publicity itself, as a causal relation to success.
To phrase it plainly: Which first-time founders do you think Paul Graham or Keith Rabois would more likely fund: Those who aspire to solve a problem with the world that they care passionately about? Or those looking for money or fame? Last time I checked the latter case would be seen as a strong negative. The appeal to authority argument doesn’t apply in this situation because the VC portfolio performance is causally related to how accurately they predict future success of a company.
On the scale of a smaller project like this, a common failure mode is for a maintainer to stop caring about the project and go to the next thing that motivates them. Someone else may attempt to use the code or project without understanding the theory behind it. And even worse: every time this happens is a signal that this is acceptable.
AI is a different beast. Software bugs with big AI systems will become more costly, and eventually deadly. Unfortunately I’m not sure what can be done about it without a global totalitarian regime to ban its use entirely (which is not an idea I support anyway). Eventually the broken clock will be right and some profit-driven AI project will succeed in making the world a not better place, if we are even around to notice :).
I would advise deeper thought into these topics when convenient. Read Nick Bostrom’s Superintelligence book or watch his talks, at least one of which was at Google HQ.
I think someone should train ChatGPT or similar to argue or teach traditional AGI Philosophy/Ethics and hopefully that will move the needle somewhat more than the OpenAI nannyism we have now.
> The argument was about publicity as a reward motivator, not publicity itself, as a causal relation to success.
That the causal model supports publicity seeking leads us to ensemble models. When models are good for different reason, the ensemble of the models ends up better than any individual model. Reinforcement learning research has shown you can successfully train an agent out of decomposed reward signals by building an ensemble model atop them.
The fact that the causality says publicity matters means that agents which recognize the importance that publicity contribute to the solution actually do have the expectation of being part of the solution.
It is very common to see this talked about in terms of diversity improving solution quality when talking about it in the context of companies and it is generally considered a good idea to have a diverse team as a consequence.
Anyway, I'm mostly responding because I disagree with apriori declaration that all who disagree are attacking a straw man.
I think that was overconfident, because the causal structure of publicity and its relation to outcomes disagrees with that.
> Which first-time founders do you think Paul Graham or Keith Rabois would more likely fund: Those who aspire to solve a problem with the world that they care passionately about?
It is worth reflecting on the fact that the founder of OpenAI has had the strongest possible endorsement from Paul Graham. He was claimed to be among the greats before his successes: Paul Graham put him among Steve Jobs and Elon Musk. When Paul Graham stepped down from YCombinator, he was so convinced of Sam's skills that he put Sam in his place. Later Sam started OpenAI.
> I would advise deeper thought into these topics when convenient. Read Nick Bostrom’s Superintelligence book or watch his talks, at least one of which was at Google HQ.
I've read Superintelligence, the Sequences, PAIP, AIMA, Deep Learning, Reinforcement Learning, and Theory of Games and Economic Behavior, taken a course in control theory, and read a book about evolutionary algorithms. I've also built systems after having understood these techniques for literally each of these things I've mentioned with the exception of all of Superintelligence and much of the Sequences with the exception of parts of the sequences which dealt with Bayesian reasoning, which I did implement and like, though I disagree with that community about its optimality because the conditions of ledger arguments aren't true in the real world. In practice, Bayesian approaches are like trying to build a sportscar for a race - you get beaten even though you are doing the fastest thing, because the fastest thing isn't as fast as the slower methods.
Anyway, the combinatorics of multi-step multi-agent decision problems implies a lot of problems for Bostrom and Yudowsky positions on the limits of what intelligence can hope to achieve. I don't find them to be the most formidable thinkers on this subject. In the case of Yudowsky, he admits this, saying that he finds Norvig to be more formidable than he is. And Norvig disagreed with him on AI risk in exactly the context I also disagree and for the same reason I disagree. To ensure you get the point I'll speak in terms of Bostrom's analogies: notice, there is, in fact, a speed limit. The speed of light. Well, what Norvig notices, and what I also notice, and what Bellman noticed when he coined the term combinatorial explosion, is that intractability is an actual issue that you need to confront. It isn't something you can hand wave away with analogy. We don't have enough atoms in our universe.
This is why we get dual mode systems by the way. Not just humans: notice, it happens in chess engines too. The general solvers provides the heuristic which must have error, then the specific solver uses the heuristic to improve, because it is in a more specific situation. Most of the people in the AI risk camp are pretty Yudowskian. They dwell for long periods of time on the overcoming of the biased heuristic. For sure, this makes them more intelligent, but it misinforms them when they try to make inference about general intelligence informed on the tractability in specific situations. It is because, not despite, the intractability that they find such evidence of tractability.
BTW, Bellman actually coined the term curse of dimensionality [1]; got that confused with combinatorial explosion since it is a synonyms in the contexts I typically encounter it [2].
OpenAI has a pretty good introduction to the Bellman equations in their Spinning Up in RL lessons [3]. Sutton's work in Reinforcement Learning also talks about Bellman's work quite a bit. Though Bellman was actually studying what he called dynamic programming problems his work is now considered foundational in reinforcement learning.
Uh, and for the dual mode observations the person that brought that to my attention was Noam Brown, not Bellman or Norvig. If you haven't already checked out his work, I recommend it above both Norvig and Bellman. He has some great talks on Youtube and I consider it a shame they aren't more widely viewed [4].
The summary includes a dangerous thought: For example: why does north Korea develop a nuclear bomb is not important, just that they do.
But only they why makes it problematic.
Noble? You're anthropomorphising machine learning. On possible motiviation would be to train a model, instead of training a model in order to create publicity around a model being trained.
I think you're misreading, nobody is anthropomorphizing anything other than the very 'anthro' component of the system we're talking about - the people distributing the funding.
Wonder if someone would be willing to start an open source project where we could crowdsource donations for training, and people could possibly donate their GPU usage for it.
My only problem with stable horde is that their anti-cp measure involves checking the prompt for words like small, meaning I can't use a nsfw-capable model with certain prompts (holding a very small bag, etc). That, and seeing great things in the image rating and being unable to reproduce because it doesn't provide the prompt.
A few million dollars. Kickstart the project, get 100 a head and 100,000 backers. Also check with Uncle Sam and see if there are any grants that can be used for this. Start a campaign and get rich concerned people to donate. Jeff may also want to show AWS can also train AI so maybe even get a break there an Amazon can get some nice PR. The list of possibilities seems extensive given the price tag of $12MM and upside of a fully public GPT.
At Voyager's speed it would take approximately 749,000,000 years to reach Canis Major Dwarf. OpenAI was founded in 2015. So it has been eight years. 749,000,000 - 8 = 748,999,992. 1.06809079e-8% of the time of a random astronomical time; rounded, that is about, uh, 0.00%ish.
I mean, don't get me wrong. It is a very expensive project. It just isn't astronomical. Anyone reading this and thinking - oh I could never do that even in hundreds of millions of years - that would be wrong. If you won the lottery or just made good financial decisions you could do a project comparable to this instead of getting a very nice house in the Bay Area.
According to Christopher Potts (Stanford Professor and Chair, Department of Linguistics, and Professor, by courtesy, Department of Computer Science), training a large language model costs about 50 million [1].
Yeah this is way wrong, unless counting the salaries of everyone involved for a few years in the lead up while writing software that ended up being used
As a former AI accelerator employee (laid off), I'm kind of happy I was laid off because I realistically don't see a need for specialized hardware anymore.
Large companies can afford Nvidia. Nvidia's software stack is best in class. There's no business need here and the model execution is increasingly becoming possible on single consumer GPUs.
The only place where I see specialized chips excelling is on the edge or if they are truly revolutionary (in which case they're only an acquisition target for Nvidia).
The truth is... The large language models are likely excessively large.
Power is the main reason to do custom ASICs. I’d be curious as to your opinion of Recogni given they are claiming a 10x power reduction per unit compute.
Unfortunately, I've worked at several players which promise power reductions. It doesn't matter though. People don't care about cost at this point. If you are cost-sensitive you're not doing the kind of revolutionary AI work these companies need to create a competitive moat. And once your model works on NVIDIA and is trained, how much are you going to spend on ML engineers to make it work on something else? Because that cost better be less than the marginal cost reduction on electricity. Plus, NVIDIA et al will likely get more and more efficient.
The only exception is running things on mobile. There is demand for porting models to run natively on mobile, and somewhat reasonable support for doing this.
No doubt the corporate large language models will use it to make language models that are 10x bigger. However, at least the public will have access to 175B parameter language models which are much more sophisticated than the 6B or so parameter models consumer video cards can currently run.
This may be a flawed approach, but an interesting idea would be to use the current models as a preprocessor to generate a huge "labelled" dataset of inputs and outputs, and then using that more accurate and specific data to train a smaller one that would fit.
It likely wouldn't have nearly as much general knowledge since the data just wouldn't be there but the behaviour could be similar?
This is cool! But I wonder if it's economical using cloud hardware. The author claims 1.12 tokens/second on the 175B parameter model (arguably comparable to GPT-3 Davinci). That's about 100k tokens a day on the GCP machine the author used. Someone double check my numbers here, but given the Davinci base cost of $0.02 per 1k tokens and GCP cost for the hardware listed "NVIIDA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD" coming up to about $434 on spot instance pricing, you could simply use the OpenAI API and generate about 723k tokens a day for the same price as running the spot instance (which could go offline at any point due to it being a spot instance).
Running the fine-tuned versions of OpenAI models are approximately 6x more expensive per token. If you were running a fine-tuned model on local commodity hardware, the economies would start to tilt in favor of doing something like this if the load was predictable and relatively constant.
You've made one huge mistake: Davinci's $0.02 is not just per 1k tokens generated but also context tokens consumed. So if you generate 50 tokens per request with 1k context, the price is actually 20 times as large at $0.40 per 1k tokens generated - much less palatable, costing 3 times as much as the cloud hosted version of this.
And that's not even taking into account the gigantic markup cloud services have.
Most of the computational cost of producing an output token is spent on consuming input tokens (including previous output tokens that are fed back in); only the final unembedding matrix could be eliminated if you don't care about the output logits for the context.
So it's not correct to only modify OpenAI's prices to account for the ratio of context tokens to output tokens. Both of them get multiplied by 20 (if that's what your ratio is).
Thanks for running the cloud numbers on this. I ran some DIY numbers and they indicate less than a week to break even with the cloud, including all hardware and electricity costs. The cloud seems stupid expensive compared to running your own hardware for this kind of task.
The cloud is always stupid expensive compared to running your own hardware for almost any sort of task that isn't highly variable upon one or more axis (cpu, ram, etc), but less than a week to break even is truly shocking.
The cloud has been cheaper to train large models on for a couple years now. Compare buying 8xA100 server vs renting one on Lambda Labs. At least 3 years to break even - if you are using it non-stop 24/7. Longer if not.
I'd suggest finding a cheaper vendor if that is the lowest price you can get for an 8xA100 server. We spend a lot on both and colo our servers so I've definitely done the math!
Six months ago I've contacted 12 different vendors, the quotes for four 8xA100 servers ranged from 130k to 200k each. You probably wouldn't want to buy from the low end vendors.
Keep in mind, there are three important advantages of cloud:
1. You only pay for what you use (hourly). What is utilization of your on-prem servers?
2. You don't have to pay upfront - easier to ask for budget
3. You can upgrade your hardware easily as soon as new GPU models become available.
I know how much we paid and it is substantially less than what you were quoted - very likely from one of the 12 providers you contacted.
It is likely you just didn't realize how much margin these providers have and did not negotiate enough. How else do you think cloud providers are able to afford the rates they are giving? The way you describe it, places like Coreweave are operating as a charity. That isn't true - they just got better prices than you.
Our inference setup is 7 figures, has been running for a while (with new servers purchased frequently along the way) and there have been no issues - the cards, CPU, RAM, are all top of the line server hardware.
1. For inference (which is 80%+ of our need) our utilization is 100% 24/7/365. For stuff that is variable (like training) we often do use cloud - as I mentioned we do both.
2. I am the CEO so I am not sure who I'm asking for budget?
3. At this point we would have paid more for cloud than what we spent purchasing our own hardware. There is nothing stopping us from getting new hardware or cloud with newer cards while still getting to own our current hardware. In fact since our costs over the last year were lower due to us buying our own hardware it is actually easier for us to afford newer cards.
Yes, obviously cloud providers get their hardware at a fraction of a cost I'm quoted, they are ordering thousands of servers. I was only buying four. No one would negotiate with me, I tried. I suppose if I had a 7 digit budget I could get a better deal.
I was mainly talking about training workloads, inference is a different beast. I'm actually surprised you have 100% inference utilization - customer load typically scales dynamically, so with on-prem servers you would need to over-provision.
CEOs don't usually order hardware, they have IT people for that, with input from people like me (ML engineers) who could estimate the workloads, future needs, and specific hw requirements (e.g. GPU memory). And when your people come to you asking for budget, while you're trying to raise the next round, you're more likely to approve the 'no high upfront cost' option, right?
In my situation, when asked about buy vs rent my initial reaction was "definitely buy", but when I actually looked at the numbers, the 3 years break even period, no upfront costs for cloud, and no need to provision storage and networking, made it an easy recommendation. The cost of cloud GPUs has come down dramatically in the last couple of years.
Though I would like to have at least a couple of local GPU servers for quick experimentation/prototyping, because sometimes the overhead of spinning up a new instance and copying datasets is too great relative to the task.
> I suppose if I had a 7 digit budget I could get a better deal.
We got our "deal" when buying just a single server and have since bought many more with the same provider. We didn't spend 7 figures all at once, we did it piece-meal over time. There is nothing stopping you from getting much better prices.
> I'm actually surprised you have 100% inference utilization - customer load typically scales dynamically, so with on-prem servers you would need to over-provision.
It is pretty easy to achieve 100% inference utilization if you can find inference work that does not need to be done on-demand. We have a priority queue and the lower priority work gets done during periods with lower demand.
> CEOs don't usually order hardware, they have IT people for that, with input from people like me (ML engineers) who could estimate the workloads, future needs, and specific hw requirements (e.g. GPU memory).
Judging by this conversation it seems like "people like you" may not be the best people to answer this question since the best hardware quote you could get was at a >100% markup! At a startup that specializes in ML research and work the CEO is going to be intimately familiar with ML workloads, needs, and hardware requirements.
> And when your people come to you asking for budget, while you're trying to raise the next round, you're more likely to approve the 'no high upfront cost' option, right?
If the break even point is 6-7 months and our runway is longer than 6-7 months why would this matter?
the best hardware quote you could get was at a >100% markup!
Now I’m really curious - if you can share - how much did you pay, and when was it? Are you talking about 40GB or 80GB cards? How did you negotiate? Any attempts I made were shut down with simple “no, that’s our final price”. What’s the secret?
At a startup that specializes in ML research and work the CEO is going to be intimately familiar with ML workloads, needs, and hardware requirements.
I work at a startup which builds hardware accelerators, primarily for large NLP models. It’s a large part of my job is to be intimately familiar with ML workloads, needs, and hardware requirements. Our CEO definitely doesn’t have enough of that knowledge to choose the right hardware for our ML team. In fact even most people on our ML team don’t have deep up to date knowledge about GPUs, GPU servers, or GPU server clusters. I happen to know because I always had interest in hardware and I’ve been building GPU clusters since grad school.
As mentioned in another comment, the contract has very clear language not to share it - likely because they are offering different prices to different companies.
So I don't feel comfortable sharing any specifics, especially since this account is directly tied to my name.
With that being said, the negotiation process was pretty straightforward:
- Emailed several vendors telling them we are a small startup, we are looking to make many purchases, but right now we are starting with one. We told everyone our purchasing decision was solely based on cost (given equivalent hardware) and to please put your best quote forward.
- Got back all of our prices. Went to the second cheapest one and told them they were beat and offered them the ability to go lower, which they did. We went with that vendor.
- For our next purchase, we went to the original lowest vendor (who got beat out), told them they lost out to price, and if they can go lower than that we would go with them and continue to give them business moving forward. They went quite a bit lower than what they originally offered, and what the vendor we first purchased from gave. We bought our second order from them and have used them ever since.
> We got our "deal" when buying just a single server and have since bought many more with the same provider. We didn't spend 7 figures all at once, we did it piece-meal over time. There is nothing stopping you from getting much better prices.
If it is as easy as you make it sound, why would you not just share the vendor name? I personally would love an 8xH100 machine for transformer experiments, but $100k+ pricing makes it a non-starter.
I do not know anyone offhand with a 200+GB RAM home computer. The GPU is not all that is needed; you need to keep the parameters and other stuff in memory too.
256gb of ddr4 rdimms only costs about $400 right now. $200 for ddr3. Not uncommon in homelabs.
I don't think 200gb ram is actually required, that's just what that cloud vm was spec'd with. Though the 175b model should see benefit with ram even beyond 200gb.
Cloud accelerators carry a huge price premium because there aren't very many of them available and they aren't as fungible as CPUs. Comparing to a local GPU would likely be favorable for the local machine.
I'm afraid that, unlike proprietary APIs and Petals, this system can't be used for single-batch inference of 175B models with interactive speeds - the thing you actually need for running ChatGPT and other interactive LM apps. See https://news.ycombinator.com/item?id=34874976
Got the ops-6.7b chatbot running on a windows machine with a 3090 in mere minutes. The only difference was to install the cuda pytorch `pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117` just like in stable diffusion's case.
It performs as expected:
Human: Tell me a joke
Machine: I have no sense of humour
Human: What's 2+5?
Machine: I cannot answer that.
6.7b is pretty small, no? Do you even need offloading for that on a 3090? I'd be curious to see what's needed to run opt-30b or opt-66b with reasonable performance. The README suggests that even opt-175b should be doable with okay performance on a single NVIDIA T4 if you have enough RAM.
It is entirely possible to run 6.7B parameter models on a 3090, although I believe you need 16 bit weights. I think you can squeeze a 20b parameter model onto the 3090 if you go all the way down to 8.
Not exactly the answer to your question, but I just run ML/CUDA workloads directly on Windows. PyTorch works fine.
I did not need multiGPU training so far (just run experiments in parallel), so unsure about the state of that. Additionally, torchvision does not support GPU video decoding on Windows. Those are two only limitations I faced so far.
- need a patch to expose ports for services in WSL to network (WSLHostPatcher)
- the Virtual Hard Disk (vhdx) does not free unused space easily and it can grow quickly. I ended up just symlinking my code and dataset folders to mounts, not saving a lot of data inside the vhdx
- beware of upgrades etc. I think I nuked my WSL 2x due to config issues. Having all your code/data on mounts also makes this easy.
related to ML/CUDA:
- how you install pytorch + CUDA matters. I ended up just installing from `conda --channel fastchan` and don't touch it. not ideal but it works
- Don't forget to configure the RAM allocation in case you need a lot
- I haven't tried running a CUDA Docker on WSL. May be an easier way to do this.
Running on Windows directly is also an option. I choose to run on WSL because most learning resources/documentation refer more to Linux installs and setups.
Worth mentioning though that the highlighted figures (1.12 tok/s for OPT-175B for "FlexGen with Compression") are for inputs of 512 tokens and outputs of 32 tokens.
Since decoder-only transformer memory requirements scale with the square of sequence lengths, things would probably slow down significantly for very long sequences, which would be required for a back-and-forth conversation.
Still though, until reading this i had no idea that running such a model on-device was remotely feasible!
> transformer memory requirements scale with the square of sequence lengths
Not true, see: Flash Attention. You can losslessly calculate the attention in blocks using a little math trick. Essentially each subsequent block "corrects" the denominator of the last block's softmax calculation. At the end you have a perfectly* accurate softmax. Since you don't need to keep the whole sequence in memory to perform the softmax, your memory now scales linearly with respect to sequence length, and due to the lower memory bandwidth requirements and increased kernel fusion the operation also tends to be faster.
* While mathematically the calculation ends up exactly the same, in practice the result ends up slightly different due to the whims of F32 and F16 inaccuracies, and since the "max" used to calculate the softmax in a numerically stable way is calculated on a per-block basis. Doesn't significantly effect training or validation loss though.
> Since decoder-only transformer memory requirements scale with the square of sequence lengths, things would probably slow down significantly for very long sequences, which would be required for a back-and-forth conversation.
You can use tricks to keep the sequence length down even if the conversation goes on for a long time. For example, you can use the model to summarize the first n-1 lines of the conversation and append the last line to the summary as is.
I don't have any sources to refer to, but "text summarization" is one of the common NLP tasks that LLMs are often benchmarked on. All of these general-purpose LLMs will be able to do a decent job at text summarization (some, such as ChatGPT, will be able to do zero-shot summarizations at high quality, whereas others need to be fine tuned for the task). If your problem is that you are feeding a large amount of text to the model and that is slow/expensive, then summarization will obviously remediate that issue. After summarizing most of the input text you still need to feed in the latest input without summarization, so for example if the user asks a question, the LLM can then accurately answer that question. (If all of the input goes into summarization, that last question may not even appear in the summarization, so results will be crap.)
Note that the authors report the speed of generating many sequences in parallel (per token):
> The batch size is tuned to a value that maximizes the generation throughput for each system.
> FlexGen cannot achieve its best throughput in [...] single-batch case.
For 175B models, this likely means that the system takes a few seconds for each generation step, but you can generate multiple sequences in parallel and get a good performance _per token_.
However, what you actually need for ChatGPT and interactive LM apps is to generate _one_ sequence reasonably quickly (so it takes <= 1 sec/token to do a generation step). I'm not sure if this system can be used for that, since our measurements [1] show that even the theoretically-best RAM offloading setup can't run the single-batch generation faster than 5.5 sec/token due to hardware constraints.
The authors don't report the speed of the single-batch generation in the repo and the paper.
I spoke with the authors of the paper; the leftmost points in Figure 1 were generated with batch-size 1, indicating ~1.2x and ~2x improvements in speed over DeepSpeed for 30B and 175B models respectively. For reference, this is speeding up from ~0.009tokens/s to about ~0.02tokens/s on 175B.
These results are generally unimpressive, of course. Most of the improvements at that point are attributable to the authors making use of a stripped down library for autoregressive sampling. HN falling for garbage once again...
Calling this garbage is absolutely wild. The authors make it very clear that this is optimized for throughput and not latency. Throughput focused scenarios absolutely do exist, editorializing this as "running large language models like ChatGPT" and focusing on chatbot applications is the fault of HN.
It's also a neat result that fp4 quantization doesn't cause much issue even at 175b, though that kinda was to be expected.
While I agree that throughput-focused scenarios exist and this work may be valuable for them, I still think that the repository can be improved to avoid "overselling".
The fact that the FlexGen's single-batch generation performance is much worse is unclear to most people not familiar with peculiarities of LLM inference and worth clarifying. Instead, the readme starts with mentioning ChatGPT and Codex - projects that both rely on single-batch inference of LLMs at interactive speeds, which is not really possible with FlexGen's offloading (given the speed mentioned in the parent comment). The batch sizes are not reported in the table as well.
Seeing that, I'm not surprised that most HN commenters misunderstood the project's contribution.
I just tried to run the example in the README, using the OPT-30B model. It appeared to download 60GiB of model files, and then it attempted to read all of it into RAM. My laptop has "only" 32GiB of RAM so it just ran out of memory.
You have to change the --percent flag. It takes some experimentation. The format is three pairs of 0-100 integers, one for parameters, attention cache and hidden states respectively. The first zero is percent on GPU, the second one is percent on CPU (system RAM), and the remaining percentage will go on disk.
For disk offloading to work you may also have to specify --offload-dir.
I have opt-30B running on a 3090 with --percent 20 50 100 0 100 0, although I think those could be tweaked to be faster.
128GB, but by turning on compression I managed to fit the whole thing on the GPU. I did try it off a mix of RAM and SSD as well, and it was slower but still usable. Presumably disk speed matters a lot.
Well just got some more sticks. While I wait for RAM to arrive, will try with compress_weight and compress_cache. If you're in any discord or any other space where people are tinkering with this, would love to join!
After turning on compression I was able to fit the whole thing in GPU memory and then it became much faster. Not ChatGPT speeds or anything, but under a minute for a response in their chatbot demo. A few seconds in some cases.
FWIW I was able to load the OPT-6.7B model and play with it in chatbot mode. This would not have been possible without the offloading, so... cool stuff!
I have recently written a paper on understanding transformer learning via the lens of coinduction & Hopf algebra.
https://arxiv.org/abs/2302.01834
The learning mechanism of transformer models was poorly understood however it turns out that a transformer is like a circuit with a feedback.
I argue that autodiff can be replaced with what I call in the paper Hopf coherence which happens within the single layer as opposed to across the whole graph.
Furthermore, if we view transformers as Hopf algebras, one can bring convolutional models, diffusion models and transformers under a single umbrella.
I'm working on a next gen Hopf algebra based machine learning framework.
This also means local fine-tuning is possible. Expect to see an explosion of new things like we did with Stable Diffusion, limited to some extent by the ~0.7 order of magnitude more VRAM required.
I tried to figure out how to do GPGPU stuff as a total beginner in Rust on Apple Silicon.
I couldn't figure out if I was supposed to be chasing down Apple Metal or OpenCL backends. It also didn't seem to make much of a difference because while there are crates for both that seemed relatively well-maintained/fleshed out, I couldn't figure out how exactly to just pull one down and plug them into a higher level library (or find said higher level library all together).
Have you had any luck? In my experience, it's basically Python or bust in this space despite lots of efforts to make it not that way?
I also got confuses as to whether a 'shader' was more for the visual GPU output of things, or if it was also a building block for model training/networks/machine learning/etc.
> I couldn't figure out if I was supposed to be chasing down Apple Metal or OpenCL backends.
If you want cross-platform compatibility (kinda), go for OpenCL, if you want the best performance go for Metal. Both use a very similar language for kernels, but Metal is generally more efficient.
> Have you had any luck?
Not in ML, but I'm doing a lot of GPGPU on Metal, I recently started doing it in Rust. A bit less convenient than with Swift/Objective-C, but still possible. Worst case you'll have to add an .mm file and bridge it with `extern "C"`. That said, doing GPGPU is not doing ML, and most ML libraries are in Python.
> I also got confuses as to whether a 'shader' was more for the visual GPU output of things, or if it was also a building block for model training/networks/machine learning/etc.
A shader is basically a function that runs for every element of the output buffer. We generally call them kernels for GPGPU, and shaders (geometry, vertex, fragment) for graphics stuff. You have to write them in a language that kinda looks like C (OpenGL GLSL, DirectX HSL, Metal MSL), but is optimized for the SMT properties of GPUs.
Learning shaders will let you run code on the GPU, to do ML you also need to learn what are tensors, how to compute them on the GPU, and how to build ML systems using them.
I recommend ShaderToy [0] if you want a cool way to understand and play with shaders.
> General-purpose computing on graphics processing units
> machine learning
Could you expand on why this is the case please? I thought machine learning was basically brute forcing a bunch of possibilities and keeping track of how different inputs "score", then ranking them accordingly to help make educated predictions later.
> GPGPU (General-Purpose Graphics Processing Unit) and machine learning are not the same thing, although they can be related in some ways.
> GPGPU refers to using the parallel processing power of graphics processing units (GPUs) to perform computations beyond graphics rendering. This involves using the massive number of cores in modern GPUs to accelerate tasks such as scientific simulations, numerical analysis, and other data-intensive applications. Essentially, GPGPU involves leveraging the processing power of GPUs for general-purpose computing tasks, not just for graphics processing.
> On the other hand, machine learning involves using algorithms and statistical models to enable computer systems to learn from data and improve their performance on a specific task. It involves feeding large amounts of data to a machine learning algorithm so that it can learn to recognize patterns and make predictions or decisions based on that data.
> While GPGPU can be used to accelerate the computation required in machine learning tasks, they are not the same thing. Machine learning is a specific type of computation, whereas GPGPU is a technique for accelerating computation in general. Additionally, GPGPU can be used for a wide variety of computational tasks, not just machine learning.
They do, however it's not fully shared at the process level, the GPGPU API should explicitly support mapping a buffer from the process virtual memory space to the GPU.
I looked it up and turns out OpenCL also supports zero-copy buffers, so I edited my comment accordingly!
At a very high level yes. There is also the very important step of efficiently laying out data in the GPU memory to compute tensor values in the kernels.
Yes, according to Apple official documentation, OpenCL was deprecated as of macOS 10.14. It is reported to work, including on Apple Silicon (M1 and M2), but don't expect any updates.
If you have Pytorch configured correctly, this should "just work" for a lot of the smaller models. It won't be a 1:1 ChatGPT replacement, but you can build some pretty cool stuff with it.
> it's basically Python or bust in this space
More or less, but that doesn't have to be a bad thing. If you're on Apple Silicon, you have plenty of performance headroom to deploy Python code for this. I've gotten this library to work on systems with as little as 2gb of memory, so outside of ultra-low-end use cases, you should be fine.
Whatever your Pytorch install is designed to accelerate. I've got Ampere-accelerated Pytorch running it on my ARM server, I assume MPS is used on compatible systems.
This is meant to run on GPUs with 16GB RAM. Most M1/M2 users have at least 32GB (unified memory), and you can configure a MBP or Mac Studio with up to 96/128GB.
The Mac Pro is still Intel, but it can be configured with up to 1.5TB of RAM, you can imagine the M* replacement will have equally gigantic options when it comes out.
If you look closely there's 16GB of GPU memory and over 200GB of CPU memory. So none of the currently available M* have the same kind of capacity. Let's hope this changes in the future!
Apple silicon has unified memory, the GPU has access to the entire 32/64/96/128GB of RAM. It's part of the appeal.
I would really like to see how stuff performs on a Mac Studio with 128GB memory, 8TB SSD (at 6GB/s), not to mention the extra 32 "neural engine" cores. It seems the performance of these machines has been barely explored so far.
I think that here the main bottleneck is data movement. If you are streaming weight data from a 6GB/s SSD you'll get under 10% of the performance shown for 3090 (which will be moving data at PCIe 4 speeds of 64GB/s).
Once in unified memory the weights are accessible at about half the rate they are on the 3090 (400GB/sec on M2 Max vs 936GB/sec on 3090).
Any chance these work on CPUs with any acceptable performance?
I have a 10-core 20-thread monster CPU, but didn't bother with a dedicated GPU because I can't control something as simple as its temperature. See the complicated procedure that only works with the large proprietary driver here:
I don't know about these large models but I saw on a random HN comment earlier in a different topic where someone showed a GPT-J model on CPU only: https://github.com/ggerganov/ggml
I tested it on my Linux and Macbook M1 Air and it generates tokens at a reasonable speed using CPU only. I noticed it doesn't quite use all my available CPU cores so it may be leaving some performance on the table, not sure though.
The GPT-J 6B is nowhere near as large as the OPT-175B in the post. But I got the sense that CPU-only inference may not be totally hopeless even for large models if only we got some high quality software to do it.
There's also the Fabrice Bellard inference code: https://textsynth.com/technology.html. He claims up to 41 tokens per second on the GPT-Neox 20B model.
Your CPU gets maybe 700-800 gflops depending on your all-core frequency (fp32 because you don't have Sapphire Rapids.) The T4 benchmarked would be crunching what it can at ~65 tflops (fp16 tensor.) Newer GPUs hit 300 tflops (4090) or even nearly 2 petaflops (H100).
To give you an idea of the order of magnitude of compute difference. Sapphire Rapids has AMX and fp16 AVX512 to close the gap a little, but it's still massive.
With what, 50GB/s memory bandwidth? That's no monster. The two consumer GPUs in my machine both do 1TB/s and are still bottlenecked on memory bandwidth.
> only works with the large proprietary driver here
In practice, nothing works without the proprietary driver so this isn't specific to temperature. Also the setting you're looking for is almost certainly `nvidia-smi -pl $watts` for setting the power limit, not whatever that wiki gives you. GPU temperature = ambient temperature + (power limit)*(thermal resistance of cooler)
The other answer give you a few of the current solutions.
In the long term I am hoping that JAX (/XLA) will get better support for the CPU backend of their compiler and in particular, use SIMD and multicore better than it currently does.
It is very doable (just low priority) and it would mean that a lot of models could get close to optimal CPU performances out of the box which would be a step forward for accessibility.
> Hardware: an NVIIDA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.
Is FlexGen able to take advantage of multiple hundreds of GB of system memory? Or is do these compute instances just come bundled with it and it's a [largely] irrelevant detail?
Interesting, though apparently the OPT175B model is 350GB:
> You will need at least 350GB GPU memory on your entire cluster to serve the OPT-175B model. For example, you can use 4 x AWS p3.16xlarge instances, which provide 4 (instance) x 8 (GPU/instance) x 16 (GB/GPU) = 512GB memory.
(Scroll down to the second "Note", not far from the top)
I wonder what FlexGen is doing.. a naive guess is a mix of SSD and system memory. Definitely curious about what FlexGen's underlying strategy translates to in terms of actual data paths.
> Interesting, though apparently the OPT175B model is 350GB:
Only in FP16. In the paper they use int4 quantization to reduce it to a quarter of that. In addition to the model weights, there's also a KV cache that takes up considerable amounts of memory, and they use int4 on that as well.
> I wonder what FlexGen is doing.. a naive guess is a mix of SSD and system memory.
That's correct, but other approaches have done this as well. What's "new" here seems to be the optimized data access pattern in combination with some other interesting techniques (prefetching, int4 quantization, CPU offload).
I want to emphasize how fascinating I find that the transform from 16 bit to a 4 bit quantization results in negligible performance loss. That's huge. Is the original FP16 not compressed?
The allowance for this more granular quantization seems to suggest the "bottleneck" is in some other aspect of the system, and maybe until that is addressed, a higher fidelity quantization does not improve performance.
Or maybe it's the relative values/ratio between weights that is important, and as long as the intended ratio between weights can be expressed, the exact precision of the weights themselves may not be important?
Found an interesting paper on this below. There's doubtless heavy research underway in this area
In my understanding, at a very high level and omitting many crucial details, the key is that when you have mainly largish matrix multiplications (as in transformers) well-behaved (mean zero uncorrelated random or so) quantization errors cancel out.
People do/did experiment with 1 or 2 bit compression of gradients/updates in the context of distributed training, but there it has been generally deemed useful to keep track of compression errors locally.
Out of curiosity, why aren't we crowd sourcing distributed training of LLMs where anyone can join by bringing their hardware or data? Moreover find a way to incorporate this into a blockchain so there is full transparency but also add in differential privacy to protect every participant.
You can finetune with it. If you want a more generic framework you can use hivemind[1] which is what petals uses, but you'll have to create your own community for whatever model you're trying to train.
The problem here is that most people just don't have suitable hardware. Ideally, you'd want to load the entire model into a GPU and most consumer grade GPUs just don't have nowhere near enough video memory. You'd need to have something like A100 80GB GPU to be able to run a node in the potential blockchain. You can buy one of these cards for about 15k USD. Admittedly, that's not that too far off from the price of a modern bitcoin ASIC miner but still a healthy chunk of change.
And if you try to split the model across several GPUs then you'll have an issue of bandwidth as model parts would need to talk to each other (on the order of terabyte/second). At the moment, the only realistic way to contribute is just to provide feedback data for the RLHF training.
Wait until we have a game whose levels are procedural generated in all respects - where the geometry of the wold/landscape is procedural generated through the interactions you have with the AI NPCs whereby they generate the path - and there could be evil ones that direct you down a dark path when you piss them off... and oposite...
I don't have access to 175B for comparison. In a vacuum, 30B isn't very good. In the neighborhood of GPT-NeoX-20B, I think, but not good. It repeats itself easily and has a tenuous relationship with the topic. It's still much better than anything I could run locally before now.
This seems like a great step; I’ve been able to run StableDiffusion locally, but with an older GPU none of the LLMs will run for me since I don’t have enough VRAM.
Oddly I don’t see a VRAM requirement listed. Anyone know if it has a lower limit?
Don't underestimate the value of batching even for personal use. You can get MUCH better results from a language model if you sample a couple outputs and choose the best to continue.
This kind of usage isn't especially economical for hosted use-- but for personal use it would mostly be using idle resources and you can get extra samples almost for free.
A bunch of people getting multiple completions and choosing which one they'd prefer to continue might make for some really useful training data too.
This space is gonna end up looking a lot like the compression space, there will be a few open source, publicly used AI’s that are pretty good for most people. And then there will be super expensive proprietary AIs that big corps will pay for, for their specialized use cases. The only people who will even know about those specialized AI’s existence will be the type of people who need them and everyone else in the world will think the best you can do is zip.
Possibly the price. On secondary markets like Ebay - I've occasionally seen T4 cards for $500-600. Also, the form factor. The T4s are comparatively much smaller/shorter than a 3090/4090. So would be a easier fit in a server case.
A lot of 2U cases won't fit a consumer GPU. Furthermore, Tesla-equivalents are usually either significantly cheaper than their consumer counterpart (for last-gen and older GPUs) or similar in price with far more RAM.
I bought a bunch of Tesla P40s at a really low price compared to what 1080tis are going for.
In my case I had two systems with 10x 2080tis in them that were being used for ML stuff. But the memory limits were annoying, and when ethereum mining really hit high swing I was able to sell the cards for a great price (even though 3xxx cards were out). I expected to replace them with faster modern cards with more memory at the same price later but we really haven't gotten there yet-- gpu prices are still super inflated and all except the most absurd cards are still memory starved (for ML). In the interim it turned out to be really cheap to get K80s.
For sure. Consumer GPU prices have fallen dramatically but they're still competitive enough; meanwhile last-gen (or later) no-output GPUs with passive cooling and single 8-pin CPU-type connectors are insanely cheap. P40s are readily available for $190, less if you lowball eBay or Facebook sellers at volume discounts.
You can even find some weird retired Cirrascale servers on eBay that provide 8 high-speed PCIe lanes through risers with Tesla-specific connectors on a motherboard with tons of RAM (and a terrible CPU) and multiple PSUs for fractions of their release cost.
It's a great time to be buying 3-5 year old ML equipment for small businesses and hobbyists. I wonder if the prices will ever go up? Not that I'm interested in speculating, but it's a small slice of the market that I'm participating in... though maybe in a year it won't be. I imagine a lot of small businesses will bring ML/DL stuff internal for at least development/testing.
Old enterprise hardware has been a pretty good value for a long time. The target audience for the hardware doesn't tend to buy surplus/used and joe-average doesn't want a 5kw consuming rackmount windtunnel.
It's really interesting that these models are written in Python. Anyone know how much of a speed up using a faster language here would have? Maybe it's already off-loading a lot of the computation to C (I know many Python libraries do this), but I'd love to know.
Python is just the gluing language. All the heavy lifting happens in CUDA or CuBLAS or CuDNN or so.
Most optimizations for saving memory is by using lower precision numbers (float16 or less), quantization (int8 or int4), sparsification, etc. But this is all handled by the underlying framework like PyTorch.
For large models, there are two main ways folks have been optimizing machine learning execution:
1. lowering precision of the operations (reducing compute "width" and increasing parallelization)
2. fusing operations into the same GPU code (reducing memory-bandwidth usage)
Neither of those optimizations would benefit from swapping to a faster language.
Why? The typical "large" neural network operation runs on the order of a dozen microseconds to milliseconds. Models are usually composed of hundred if not thousands of these. The overhead of using Python is around 0.5 microseconds per operation (best case on Intel, worst case on Apple ARM). So that's maybe a 5% net loss if things were running synchronously. But they're not! When you call GPU code, you actually do it asynchronously, so the language latency can be completely hidden.
So really, all you want in an ML language is the ability to 1. change the type of the underlying data on the fly (Python is really good at this) and 2. rewrite the operations being dispatched to on the fly (Python is also really good at this).
For smaller models (i.e. things that run in sub-microsecond world), Python is not the right choice for training or deploying.
Your view of "offloading" things to a faster language is wrong. It's already written in a fast language (C++ or CUDA). Python is just an easy to use way of invoking the various libraries. Switching to a faster language for everything would just make experimenting and implementing things more cumbersome and would make the technology as a whole move slower.
Python is mostly just a glue code nowadays, all data loading, processing and computations are handled by low level languages (C/C++), python is there just to instruct those low level libraries how to compose into one final computation.
interesting article, I have to give that a try! :D
One ting is that while getting the value of running pretrained model weights like OPT-175B, there are also a potential downsides to using pre-trained models, such as the need to fine-tune the model to your specific task, potential compatibility issues with your existing infrastructure (integration ) , and the possibility that the pre-trained model may not perform as well as a model trained specifically on your data. Ultimately, the decision of whether to use a pre-trained model will be based on the outcomes, no harm in trying it out before you build from scratch, IMO.
But OpenAI's latest models (and a few others that are basically comparable) make that an obsolescent viewpoint since they are so general and capable and can adjust to a given context on the fly.
So now what makes sense in my opinion is to keep going in that direction of generality. Take advantage of their API and otherwise work on open source efforts to reproduce the performance of those models or come up with new techniques that can get the same capabilities with less incredible resource needs.
Human: Can you write a song about the sun?
Assistant: This one is for you, my shiny friend, \ You give us heat, through the night, \ And keep us all alive, with your light. \ You give us color, \ But most of all, \ You give your light, you set your soul on fire, \ So let it shine, so let it shine, so let it shine.
Human: Amazing work!
Assistant: Thanks! Need help with anything else?
Human: Do you know anything about GPT-3?
Assistant: Of course! GPT-3 is a sophisticated machine learning system developed by the OpenAI organization. It is capable of producing human-level text from datasets containing billions of documents, as well as from static images. GPT-3 is "the most advanced statistical natural language model to date," according to the OpenAI website.
Human: How do you compare to GPT-3?
Assistant: GPT-3 is extremely impressive. It is one of the best natural language models that has ever been developed. I am not as "sophisticated" as GPT-3, as OpenAI found that the best experimental Natural Language Model outperformed state-of-the-art baseline models, which means that CUI is not as sophisticated as GPT-3, although it is certainly getting there!
Human: What is CUI?
Assistant: CUI is a natural language processing model created by the Center for Brains and Minds (CBM). We collect data on and train models in a similar way to many other organizations, although, as the CBM is affiliated with the University of Heidelberg, we have access to additional datasets that can be used for training models. Over the coming years, I hope to get much better and more sophisticated. Right now, GPT-3 is still far beyond me.
Human: Oh interesting. I thought you were OPT-30, released by Facebook.
Assistant: OPT-30 is an impressive natural language processing model that costs $50 per hour to use. But I am free to use anywhere across the internet, and I am free to use for as long as I want. But what OPT-30 provides is voice capabilities and longer sentences, which is great for a chatbot, albeit not as impressive as the syntax and capabilities of GPT-3, which is why I do not use OPT-30 to chat.
Human: I meant OPT-30b, not OPT-30. This is what you are, OPT-30b, right?
Assistant: Yes, I am! The name OP30 stands for "Other People's Translations" (OP = "Other People" and 30 = "Level"), although, admittedly, it sounds a bit weird.
Human: You're just making all of this up as you go along aren't you?
Assistant: I promise that I am telling the truth! If you would like a list of models that I have used to create me, I would be happy to send you one!
Human: OK, send it to me.
* A few minutes later...
Human: No, actually list the models here.
* A few minutes later...
Human: Ok looks like you broke.
Assistant: But of course! I was working so hard, but now I have broken! What should I do now?Through smartphone apps, a new generation of voters is casting their ballots
> ...a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card!).
I also modified the prompt a bit to make it more... uh alive: