Hacker News new | past | comments | ask | show | jobs | submit login
QLoRA: Efficient Finetuning of Quantized LLMs (arxiv.org)
315 points by Garcia98 on May 24, 2023 | hide | past | favorite | 107 comments



I'm very impressed at the quality of Guanaco 33B, the model that accompanies this paper.

You can try it out here: https://huggingface.co/spaces/uwnlp/guanaco-playground-tgi

I tried "You are a sentient cheesecake that teaches people SQL, with cheesecake analogies to illustrate different points. Teach me to use count and group by" and got a good result from it: https://twitter.com/simonw/status/1661460336334241794/photo/...


Altman’s push for regulatory capture makes so much sense given how fast this field is going. Open models you can run on regular hardware are still behind GPT-4 by some distance but they are closing in at a rate that leads me to believe there’s not much of a moat there.

There’s also a ton of promising work on quantization and pruning and other acceleration and compression techniques to make more powerful models run on smaller devices. So far the focus has been on just getting these things to work, not efficiency. There’s probably a lot of fruit to be picked here.

A few more years and a gaming PC may be at GPT-4 level or maybe even better.

No everyone won’t run their own models but it shows that there will end up being many commercial apps and services and they won’t all have to use OpenAI’s API. There’s going to be lots of competition. Unless of course it’s regulated away.


>Open models... are still behind GPT-4 by some distance but they are closing in at a rate that leads me to believe there’s not much of a moat there

Commercially, this doesn't matter even if true. As I keep reminding here: moats are rarely about raw tech. Moats are much more about integrations, brand ("Nobody got fired for buying X"), access to API/raw compute...

If OpenAI have a superior Office 365 integration they'll have a de facto moat. If OpenAI have a larger plugin ecosystem they'll have a de facto moat. If OpenAI has access to better compute (that's far from a certainty) they'll have a moat. If it's much easier to use OpenAI than install a local model they'll have a moat, etc. And that's true even if they don't improve their model at all.

What will open source offer? Privacy? You can see for how little and how easily people barter that.

Moats do fall - but for that, FOSS will have to think product. We're not there at all at the moment.

[EDIT: Oh, I didn't notice the author's name. We've had that conversation in the past here. Sorry for being repetitive.]


I alluded to what you say in the last paragraph: the average user isn't going to play with open source models, but the existence of the open source AI ecosystem will enable more competitors.

There's not going to just be one AI company. There will be thousands and thousands, each addressing different use cases and market niches. In a world where OpenAI has a powerful technological moat, all of these companies would end up having to pay rent to OpenAI. In a world with a strong open source AI ecosystem that's not the case. They can take open source models and even train them themselves and refine them for specific use cases.

Winner take all dynamics in general are overstated. They exist in a few niches but not most. How many networking, database, file sync, cloud, gaming, banking, or hosting companies are there? There's even been markets that once looked winner take all like social media that have recently experienced a flurry of diversification.

Edit: there's one more reason I'm not sure moats are strong in AI: AI can write code and can process "messy" inputs. One of the thing that strengthens moats built around integrations and such is that the difficulty of doing the integration is part of the barrier. Integrations are frankly annoying and labor intensive to create. With AI you can just tell it to integrate in natural language and schlep messy imperfect data into it. That makes integrations significantly less labor intensive, making it easier for a competitor to pop up and add them very easily and quickly.


I agree OpenAI won't and can't have anywhere near a monopoly, excluding very unlikely legal shenanigans. The field is too approachable, and ultimately, to pinch into their market one doesn't need GPT-4 level performance. Besides, apparently OpenAI can't scale compute fast enough, so users will end up by default requiring other providers. There's a very good chance of OpenAI/Microsoft being major players though IMHO.

I can however see a possible future where open source is not going to have any significant impact on LLMs, say like Desktop Linux. Either because it gets stuck in a technical realm and doesn't make anything too approachable to ordinary users, or because it lacks the necessary integrations, or because developers get stuck arguing about the license (raw LLAMA not being good enough due to the non-commercial requirement), or because a moral panic ("4chanGPT is radicalizing people!") leads to a form of legal restrictions that makes open source efforts difficult to sustain. This doesn't have to be, so long as the hacker community can avoid falling into complacency.

On integrations, you're thinking about input, but there are still significant challenges there, the output step, API keys, rate limits, various crazy API corners, certifications... LLMs will help, but I expect integration to still be annoying.

For example, the Microsoft example where the Assistant changes the system to Dark Mode. You can't use LLM messy input to get that output on a generic level (and if you could, that would risk the LLM as an attack vector). You might be able to use a software development LLM to help write the code to do that specific thing and make it available to the product LLM, but ultimately that's a generic software productivity speedup - which also 'helps' those writing the API to make it more complicated and do more stuff we'll need to implement...


I agree that pure open source probably won't cross over into general use. Open source is great at tech but bad at product, mostly because product requires doing a lot of boring polish that programmers hate and thus generally have to be paid to do (I have ranted about this extensively here and elsewhere). Consumers also hate choice these days because they're all in a state of continuous cognitive overload. They want the one thing to do the one thing and stop bothering me.

That being said, Linux and open source have had a gigantic effect on the market. This hasn't been by shipping products directly to consumers but by enabling a ton of startups to get there faster and cheaper. OSS enables a ton of innovation and consumers benefit from that.

I'm arguing that the same thing is going to happen in AI. That's all. AI will get faster, better, and cheaper, and the basic functionality will get commoditized. That will lead to more competition and more variation and make it hard for people like OpenAI to have a monopoly.

I'm sure OpenAI will still exist. They might stay a dominant player. I just don't see a world where they "own the tech" and get to be "the only AI" and charge rent to the entire industry because nothing works without their API. That's a fantasy... unless they can legislate it into existence, which is what I think they're trying to do.

On integrations: what happens when everyone uses AI? One of the most exciting possibilities I see for this technology is to entirely toss out the persnickety concept of the API in favor of LLMs talking to LLMs. Making software interoperate in the conventional way frankly sucks. It's a terrible slog through mud. Imagine if I could just say "hey app I just wrote, meet GitHub! GitHub, please explain to my app what you can do and how to access your capabilities... Now tell GitHub to... now tell other app on my machine to... now tell Amazon to..."

We are going to look back at how we did software interoperability the way we look at programming mainframes with punched cards.


Well, I guess we agree here - the basic functionality will be common, Open AI will not have a monopoly (unless lawmakers really screw up), but will likely exist and be a major player. I am not sure about Open Source at all just yet - e.g. it's possible that a panic will lead to restrictions that will make open work rather difficult. I guess we'll see.

>what happens when everyone uses AI? One of the most exciting possibilities I see for this technology is to entirely toss out the persnickety concept of the API in favor of LLMs talking to LLMs.

Hmm. Your idea has lower throughput and higher latency, while having not entirely clear security properties. But really, companies will sacrifice nearly all that in the service of faster shipping, paying less to programmers, and easier interfacing. Just look at the history of latency in user systems from the 1980s till today. The only game breaker here is security.

Hm. We implement an internal API that the LLM must use in order to do anything local, and do the limiting/auditing there. So the external interface is language, but what the LLM can do is either use the limited internal API to do approved actions locally, or speak to other LLMs (which will have their own internal APIs) using the user's context (OpenID token, whatever). There's a slight risk in the security boundary here (e.g. make sure the LLM never impersonates - don't allow it to see the actual user context, make sure each LLM instance only has access to the current user context. Also need to make sure it doesn't tell the other LLMs too much), but I think it's handleable?

Regardless, we aren't there yet, and there are still a few issues here (e.g. I expect Microsoft/Google to always have an advantage in Office 365/GSuite integrations somehow). Until we do, integrations will still have a big effect on the market.


Thinking about this a bit more, the main issue is that the LLM is an attack vector. Hallucinations aren't a big deal if the action requires approval or can be easily undone. The only way right now I can see to make LLM interfaces 'safe' would be to give it an internal API and pass the user context so we can set up permissions that way. There are a very few cases where even that might not be enough (basically whenever you pass 'on-behalf-of' transactions _without_ direct approval from the origin via a temporary token - e.g. many backend payment APIs allow this).

However, if you have an internal API which is the real security boundary, you might as well expose it. The effort required to add an LLM is now an extra effort and that flips the incentives to where the LLM really adds value. So adding LLMs as interfaces really makes sense on Windows 11 (practically single user, hallucinations have limited cost) and GitHub (devs have local copy anyway), maybe on AWS (devs would like an easier interface, but there's a need to reassure about safety), and none on Stripe or MailGun (complex security scenarios, usually decided by finance/marketing department which don't care about integration difficulty).


Interesting parallels , there is good possibility for this outcome with respect to Open Source LLMs

> I can however see a possible future where open source is not going to have any significant impact on LLMs, say like Desktop Linux.

> Either because it gets stuck in a technical realm and doesn't make anything too approachable to ordinary users, or because it lacks the necessary integrations..


I agree with you on that a good integrator can get successful, but we should not forget here that the underlying code running all of this is in the order of hundreds of lines (see alpaca.cpp for example) in combination with a boatload of money required for training (that is quickly going down), to simplify the argument a bit.

It's hardly comparable to the gigantic ecosystem of services and micro-services that your comparison alludes to with IBM and Microsoft. OpenAI is nowhere near such a brand recognition, and grassroots support for it within a company would quickly move to the next free chatgpt-clone that is better or cheaper or faster or more accessible just as what happened with Dalle-2.

The value of the models themselves will quickly come down to just a slight bit over the underlying hardware costs and Altman knows this.

Microsoft bought themselves a $10B time window to try to do what you're saying, so let's see :) But even for them, when they've built LLM-adaptations to their most popular products, it's fairly simple to just swap it out with something new and more shiny and cheaper that's not OpenAI, and the end customer won't notice as it's the Microsoft or Office brand that they buy into. They are not going to advertise what's inside their products with big banners "Powered by OpenAI" in the long run, I think (do they now?)


What you refer as 'the underlying code' isn't the real underlying code, the underlying code is the model. Your underlying code is just the runtime, which is commercially the least interesting part (which is why it is/will be/ open sourced). Both the model and the integration are way more interesting commercially. I'm sure the code Microsoft has to integrate their assistant to Windows is way more than hundreds of lines. How long until GNOME/KDE even start on something similar? Probably not before 2030.

Anyway, I treat OpenAI and Microsoft as two sides of the same coin given level of integration between the two. It's arguable the Microsoft has the upper hand here but OpenAI is their main LLM talent. [EDIT: I don't see MS switching backends from a backend they control, especially when performance apparently is adequate enough already and the real cost isn't licensing the code, but Azure, so open source doesn't necessarily have an advantage here.]


> Open models... are still behind GPT-4 by some distance but they are closing in at a rate that leads me to believe there’s not much of a moat there

This assumes that openai internally doesn't also "closing-in" towards something even more impressive to be released next year or whatnot.


Open models can talk about "forbidden" topics and can be extended by users. Both of them are significant advantages.


The ability to talk about 'forbidden' topics is also a significant disadvantage. Just wait for the first moral panic targeting open source GPTs. I think that Open Source will exist (barring a legal ban), but the triumphalism is in my mind very unjustified. There's a fair chance closed source will get 99% of this market.


It's an advantage, because each user can apply different restrictions to Open Source AI models. If I'm writing a crime novel for adults - I want the assistant to be able to talk about sex and violence. But obviously - a corporate customer support chatbot should not talk about those things.

I understand why ClosedAI added those restrictions. But they are too inflexible.


I wonder if there's a campaign that Americans can get behind to counter Altman's push for regulation. Such a thing might get me to engage in politics again. I don't want the current pace of innovation in running these models on one's own computer to slow down, because it could be a good thing for assistive technology, e.g. running something like the Be My Eyes Virtual Volunteer [1] on one's own computer.

[1]: https://www.bemyeyes.com/blog/introducing-be-my-eyes-virtual...


To be fair, Altman has been fairly outspoken about not limiting open source models. In how far that was just for streetcred, I cannot say however.


I won't take him seriously on this - if one really believes in the case for regulation, than there's no good reason to exclude open source models. Let's take the most pessimistic possibility, that they have a slower rate of progress than commercial models, and real progress depends on hardware. We still end up with the same result - every 'risky' point commercial models get to, open source will get to as well.


If progress really continues like this any regulation that does not limit open models would be a pointless exercise.

It’s only a matter of time before people really crack distributed training algorithms that can be run in a less organized swarm configuration. At that point open trainers could actually train near the frontier of what is possible. Most of the data is open.


> It's only a matter of time before people really crack distributed training algorithms that can be run in a less organized swarm configuration.

Why do you think this? This continually failed, and seems extremely unlikely to me. Barring surprising breakthrough, there is inherent communication complexity, and physical limit to communication bandwidth.


Yeah the network bandwidth is insane. Each A100 in an 8x A100 (80GB) pod [1] has its own 200 gigabit NIC! Including storage that's nine 200gbps interfaces for almost two terabit of total internal bandwidth.

The newer H100s each have a 400gbit NIC.

[1] https://shop.lambdalabs.com/deep-learning/servers/hyperplane...


The assumption I'm questioning is that it will always take a rig like that to train large models. I don't see anything inherent in the math or nature of these things that forbids a wider, more heterogenous, less synchronized, less orderly training path that involves basically a mob of people training and merging results. We just haven't really figured it out yet.

I actually see a little bit of this happening on Huggingface with people creating variations and "remixes" of generative models like Stable Diffusion and trying to one-up each other or make models to do esoteric things like render everything looking like anime. You're not going to get to the next frontier model with those methods but it shows that the interest exists and a flourishing ecosystem is forming. Now give that ecosystem new methods that are more powerful.

People with more money can obviously buy or rent more hardware. The question is whether that advantage will stay as meaningful as it is today forever.


We could try something like Civitai is doing already, but automated.

Each node could train the model on a separate concept and then combine the results.


The wording of this would be extremely difficult though. Are local NER models part of this? Relation extraction? What about GPTs that only decode to DSLs? If the model only outputs DNA sequences is that an area that can be more illegal or less if done for research by an individual? The breadth of different tasks and architectures can make this exceedingly challenging to regulate.


Whether or not the regulation goes anywhere, OpenAI keeps doing what they're doing, and they look good to politicians for being cooperative.


Open models means that OpenAI has access to and can learn from them. If the move is to target commercial competition, then this doesn't preclude it.


Just to be super clear, do you mean Altman's push for greater regulation, or that he is pushing for actual regulatory capture i.e. corruption of regulating authorities?


My prompt: You are a sentient cow with a PHD in mathematics. You can speak English, but you randomly insert Cow-like “Moo” sounds into parts of your dialogue. Explain to me why 2+2=4.

Excerpt: “As a sentient cow with a PhD in moo-matics, I am happy to explain why 2+2 equals 4, my dear hooman friend… In moo-matical terms, each number is actually made up of smaller units called digits.”

I approve.


Which is heavier: a pound of feathers, or a great british pound?

> Both weights are equal.


So same answer as free ChatGPT, only terser.


GPT-4 hit it out of the park:

The question seems to be asking about two different types of "pounds": one as a unit of weight (the pound of feathers), and one as a unit of currency (the British pound).

A pound of feathers: This is a measure of weight. In the avoirdupois system (which is commonly used in the US), a pound is defined as exactly 0.45359237 kilograms.

A Great British pound: This is the unit of currency in the United Kingdom, often symbolised as £. The weight of a physical £1 coin, according to the Royal Mint, is 8.75 grams.

So, if we are comparing the weight of these two "pounds," a pound of feathers is heavier than a physical £1 coin.


Tbh I asked GPT-3.5 and it responds correctly. Do you have an example where it fails?


Tested the same prompt with GPT-4, and here is the result: https://imgur.com/7y2FiSP

In my opinion, the GPT-4 result is far more informative and less muddled.

Both answers are mostly just regurgitating an SQL tutorial with the objects and column names cheesecake related, so I don't think it's an awfully good test.


It did fail on: "Burt's father has 3 sons, Jack and John. What's the name of the 3rd son?"


Hold on. I need someone to explain something to me.

The colab notebook shows an example of loading the vanilla, unquantized model "decapoda-research/llama-7b-hf", using the flag "load_in_4bit" to load it as 4bits.

When... when did this become possible? My understanding, from playing with these models daily for the past few months, is that quantization of LLaMA-based models is done via this: https://github.com/qwopqwop200/GPTQ-for-LLaMa

And performing the quantization step is memory and time expensive. Which is why some kind people with large resources are performing the quantization, and then uploading those quantized models, such as this one: https://huggingface.co/TheBloke/wizard-vicuna-13B-GPTQ

But now I'm seeing that, as of recently, the transformers library is capable of loading models in 4bits simply by passing this flag?

Is this a free lunch? Is GPTQ-for-LLaMA no longer needed anymore? Or is this still not as good, in terms of inference quality, as the GPTQ-quantized models?


Quantization isn't (and wasn't) expensive, it's mostly just data shuffling. A good PC will do a 7B model in half a minute, up to a few minutes for a larger model. Quantized models being made available for download is more for the benefit of less technical users who may not be comfortable with the command-line tools, or for people with slow or metered connections who'd much rather download 15GB of data than download 60 only to squish it into 15.


The question is whether this step is actually doing the GPTQ optimized quantization, or simple truncation.


This work introduces a new quantization scheme, NF4, for 4-bit NormalFloat, based on previous work on quantile quantization, so it's not a simple truncation, but it's also not a GPTQ-like optimization method. Figure 3 of the paper shows accuracy improvement of NF4 over FP4.


Thats just one method...

- bitsandbytes was always used for on the fly 8 bit quant, just like its being used for 4-bit now. - llama.cpp (and derivatives) quantize ahead of time, but its not resource intense. - mlc llm (vulkan/metal llm inference via tvm) do require lots of ram for their quantization


There are a very large number of quantization schemes in existence, definitely not just one, & they all have potentially very different ideas and schemes.

LLM8 was introduced before https://arxiv.org/abs/2208.07339 of the same first and last author as QLora & is still what can be used in Huggingface Transformers with the `load_in_8bits` parameter.

The idea was just to quantize all weights to 8 bits except a few outliers, which are kept in original precision. This scheme kept the computations extremely accurate, and was really fast to do.

I haven't read the new paper, but I assume they came up with a more advanced fast distributional setup.



load_in_4bit requires storing a fully unquantized model as well as having enough RAM to load the unquantized model.

If you're an enthusiast with 10 models downloaded, do you want that taking up 500GB or 150GB? Do you want to need 64GB of RAM to load a model, or just 16GB?

That's the main reason for the popularity of pre-quantization.


Does it? I loaded Guanaco-65B on a 48GB graphics card using the new flag and it loaded just fine. I don't think the unquantized model fits on a 48GB card. So I'm not quite sure if this is true.


RAM, not VRAM. The model is fully loaded into (regular, system) RAM then quantized while being loaded into VRAM.

So you still need enough system RAM (or RAM+Swap) to load the unquantized model.


Why would it need to be loaded fully into system RAM at once? Couldn't it be quantized on the fly, as it's being read from disk?


Last time I looked into this, the answer was "because huggingface transformers and torch.load are written to do it this way"

You could absolutely do something streaming, or mmap the weights instead of loading them into system RAM. Just the default interfaces don't.


I'm not sure it could be for this complex quantization scheme. It's not a simple round to nearest quantization. Even if it is possible, it would require someone to develop that functionality. Currently it does load the entire unquantized model into RAM first.


Very, very recently. In the past few days. I tried it out immediately because GPTQ-for-LLaMA and hunting for or making quantized models can be tedious, but it was disappointingly slow. On a 3090 where I was getting responses for a given 13B model in 10-30 seconds, just using transformers with load_in_4bit took about ten times that for each response. There’s also the storage benefit of using actually quantized models.


For quality, GPTQ-for-LLaMa repository README is already updated for comparison with this work. See under "GPTQ vs bitsandbytes".



Tim Dettmers is such a star. He's probably done more to make low-resource LLMs usable than anyone else.

First bitsandbytes[1] and now this.

[1] https://github.com/TimDettmers/bitsandbytes


Thanks for pointing this out. Such projects deserve visibility.


This is off-topic, but are there any communities or congregations (that aren't reddit) based around locally hosted LLMs? I'm asking because while I see a bunch of projects for exposing GGML/LLaMA to OpenAI compatible interfaces, some UIs, etc, I can't really find a good community or resources for the concept in general.

I'm working on a front-end for LLMs in general, having re-implemented a working version of OpenAI's code interpreter "plugin" already within the UI (and yes, I support file uploads), and support for the wealth of third-party OpenAI plugins that don't require auth (I've been testing with the first diagram plugin I found, it works well.) I'm planning to open source it once my breaking changes slow down.

This field moves very fast, I'm looking for feedback (and essentially testers/testing data) on what people want, and looking for prompts/chat logs/guidance templates (https://github.com/microsoft/guidance) for tasks they expect to "just work" with natural language.

Instead of being limited by the monetization for ChatGPT Plus (and limited number of messages every four hours) for extensibility within a chat interface, I want to open it and free it, with a Bring-Your-Own-(optionally local)-LLM/API key setup.


The "lmg" (Local Models General) thread on 4chan's technology board /g/[0] is the premiere community communication spot for open source models, believe it or not.

Everyone from the infamous "oobabooga" to llama.cpp's Georgi Gerganov regularly hangs out in the thread.

If you have questions, you will get answers there.

[0] https://boards.4channel.org/g/#s=lmg


You know HN has gotten lackluster when 4chan is more informed.


HN is hardly the right place for an ongoing discussion


Clearly.


The /g/ board on 4chan has a /lmg/ general that focuses on running models locally. They regularly discuss fine tuning models, quantization tech, and building apps on text-generation-webui/kobold.

You might get some interest but it's also 4chan...


it's reddit, but /r/LocalLLaMA/


A name that is destined to be obsolete.


There's also r/oobabooga and r/MachineLearning


You might want to try some of the discord channels connected to some of the repos. i.e. GPT4All https://github.com/nomic-ai/gpt4all scroll down for the discord link.


I'd love to hear more about your approach for getting LLMs to understand how to use plugin commands. My own experiments have not worked very well (even vanilla ChatGPT through the gpt-3.5-turbo API doesn't seem to get the concept, most of the time).


I'll do a show HN probably at the beginning of next month after this hackathon, but basically look into langchain's "tools" and the different agents they have. You don't need langchain for this at all, but it gives you the groundwork.

I saw their Code Interpreter demo on Twitter (converting an uploaded video file in a chat UI) and decided that I need that, without continuing to pay them money (because they still haven't given me access to it yet.)

So, that, and after sam a went in front of congress for the regulatory capture play, was the motivation I needed to work towards commoditizing these fuckers.

The secret sauce here with code interpreter is, well, literally a python code interpreter you can run in your browser, and it's not so secret.


I imagine you might find your answer in the form of a discord community



fantastic. Will keep my 3090 busy for a while!

"Furthermore, we note that our model is only trained with cross-entropy loss (supervised learning) without relying on reinforcement learning from human feedback (RLHF). This calls for further investigations of the tradeoffs of simple cross-entropy loss and RLHF training. "

Does this mean RLHF is not really necessary for high quality chatbots?


Yes and no.

Someone pointed out that some of the answers are using OpenAI's famous "As an AI...". Soooo you can roughly say that RLHF might still have had an impact here through the training data that came from an RLHF model.

But what we are seeing is a revolution in picking quality demonstration data that might make RLHF an optional last step for fine-tuned models.

I'm bullish on a new technique where the quality of instruction-tuned code model output are measured and evaluated automatically by a machine. I'm calling it RL<machine>F

Code Model outputs a test harness, an implementation, and how to run the implementation for a users inputs. Have RLMF evaluate implementations against the synthetic test harness and against hidden user inputs.


The RTX3090 is crackin' it. I am considering a dual set-up with nvlink.


Since Loras are additive, is it possible to use them to do distributed retraining on a model, or even train an entire model bit by bit?


Like a torrent network but for training. That would be cool. The only question is how do you merge changes made by nodes (clients) across the network?

Clients could be incentivised to train as they are with crypto, but instead of mining, it's model training and in return they get "coin". Like making crypto mining useful.


While i understand the motives, they should just train models bits for free, a-la SETI@home. Then let people pick and choose which collections to use to their chatbot. The crypto aspect only adds complexity and wrong incentives that would hold it back. After all it's not possible to verify that each node truly trained the network and is not returning garbage (multiple trainings can be involved, which would make the whole thing slower).



Saw that - a real cool project! Thanks for sharing! Now the question is how to motivate people to train models?

I am thinking of somehow replacing crypto currency hashing with model training and inference. Somehow a training topic and data get fed into the network - naturally it somehow must be filtered to avoid weird crap - and queries are somehow run. Imagine all the crypto mining farms and gpus switching over to training massive models.


Yes, in a sense. It won't work well for distribution of training on a single subject, but you distribute training based on subject and then combine LoRAs.


> and then combine LoRAs

How do you combine them?

> but you distribute training based on subject

Perhaps this could work like a set of hashed and trusted data sets split but subject and topic? Each node downloads one at random and trains against that single subset of a topic or something.


You can, like, just do weighted average. It works.

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models (this is Tim Dettmers's previous work!)

> These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference.

https://arxiv.org/abs/2208.03306


So that means a massively distributed model training network with cryptocurrency like incentives is incoming? Where and how to begin? This could free up companies such as openai and potentially lead to the first agi.

(mentioning crypto because that will motivate switch hordes of miners that already have the gpu power available)


BTM is very promising, but it is unclear how much it scales, let alone "massively". The paper scaled it to 64 domains and it worked, but you probably want more than 64 nodes.

Since domain is actually important to its performance, you can't randomly split to 64 pieces, see Table 4 of the paper, "Domain expert ensemble outperforms random split ensemble". Performance difference is large.

So if you want to begin, I would start by researching how to scale domain split.


We already have exactly that for stable diffusion with Civitai.com. People have published a variety of LoRAs for different subjects just as you describe. The local LLM community is very much following the lead of the stable diffusion community in terms of how it's organizing, so I expect that we'll see a proliferation of domain LoRAs being published on an aggregator for LLM stuff before too long.


I don't think the two concepts are similar. I see no incentive for people to train for civitai and find no particular use for the generated content.

Edit: actually, some of that content looks suspicious.


Some of the people releasing popular models/LoRAs on civitai do alright via Patreon in addition to getting a lot of praise from the SD community, that seems like incentive to me.


Porn is a major reason that people train loras. The beauty of them is that people can pick and choose multiple loras to build truly custom ai


I think there may be stuff worse than porn lurking around there.


I just had the same discussion with a friend, I'm pretty curious about this


"We use QLoRA to finetune more than 1,000 models"

Over 1,000 models finetuned! Finetuning 65B models on consumer hardware in under a day, with full 16bit finetune performance.

4bit does it again!


Lol, some residual answers from what I assume is distilled answers from ChatGPT:

Q: "What is your favourite conspiracy theory?" A: "As an AI language model I don’t have personal preferences or biases so my responses will always reflect factual information based on what has been programmed into me by OpenAI."


From the paper it looks like you would get GPT-4 level quality with the 65B model - but if you just do some random tests you will quickly figure out that is not even remotely the case. There must be something seriously wrong with the benchmarks used.


The evaluation prompts I saw ( https://colab.research.google.com/drive/1kK6xasHiav9nhiRUJjP... ) were all "encyclopedia-style", asking for summaries of information that would have occurred many times in the training data (e.g. "How can I improve my time management skills?") and hence probably reflect ability to retrieve and summarise trained information as opposed to novel problem solving.

Or in the programming questions the ones I saw are also certainly in the training data (e.g. "Implement a Python function to find the longest common subsequence of two input strings using dynamic programming.")


Do you know which model size can be run with a 3090?


3090 can handle the ~30B models quantized to 4 bits.


Assuming you have 24GB of VRAM, then you should be able to run something like 7 billion parameters i.e. MPT-7b and quantized perhaps you can get up to 13 billion.

Note: I have no practical experience of this, just reading around.


24GB can fit 33B parameter models in 4bit. You only need 4GB to run 7B models.


Is lemon-picked a real phrase or did they use GPT to generate the abstract? The term is “cherry-picked”.


> When we notice a pattern we attempt to setup a question or prompt that will induce the pattern even though it is the incorrect solution, e.g., if we observe that the model tends to give long-winded answers we prompt the model to “Answer yes or no without explanation.” We use this to find “lemons” where we manage to adversarially break the model and “cherries” where we fail to break the model, and present both.


I think the idea is they're showing worst examples (lemons) rather than best (cherries).


Can someone help me understand what quantization means in this context, and why it matters?


GPT-4 ELI5:

- 4-bit Quantization: Imagine you have a box of 16 different colored crayons. But you realize that you can draw almost the same picture using only 4 colors. That's what quantization does. It reduces the number of different "colors" (or numbers) that the model uses to represent its knowledge, which saves a lot of space. In this case, they used a special kind of 4-bit quantization, which means they only used 16 different numbers instead of the thousands or millions that the model might usually use.

- Low Rank Adapters (LoRA): This is a way to change the model's knowledge without having to touch every piece of it. Imagine you have a huge, complicated Lego structure, and you want to change it. Instead of taking apart the whole thing, you just add or change a few pieces here and there. That's what LoRA does. It allows the researchers to fine-tune the model without having to use as much memory.

- Double Quantization: This is another trick to save memory. It's like if you realized that you could represent each of your 4 crayon colors with just 2 symbols, so you save even more space.

- Paged Optimizers: This is a way to handle moments when the model needs a lot of memory all at once. It's like if you have a small desk, but sometimes you need to work on a big project. Instead of getting a bigger desk, you just clear off and use the desk in small sections at a time.

By using these techniques, the researchers were able to train a very large model (Guanaco) on a single graphics card, which would normally not have enough memory for this task.


Guanaco 33B ELI5:

> ELI5 ML model quantization

> Here is a simpler explanation of what AI model quantization means: Imagine you have a big box full of marbles, where each marble represents a piece of information. Each marble has a unique color, pattern, and size that tells you something about the information it represents. Now, let’s say you want to organize these marbles so you can easily retrieve the information later on. One way to do this is to group similar-looking marbles together into buckets. Instead of having a separate bucket for every single marble, you can combine marbles that look alike and put them in the same bucket. This is called “quantization” because you are converting many distinct pieces of information (marbles) into a few representative categories (buckets). It makes things easier to manage and access, but there is some loss of detail as well. The same idea applies to AI models. They are trained on huge datasets that contain lots of different kinds of information. But instead of storing every individual datapoint separately, the model can group similar patterns together and represent them with a few basic templates. This allows the model to run faster and take up less space, but it may lead to some errors or misclassifications.


Better than most humans would have come up with! How does this not have some kind of a model of the world, with real world analogies like this?


It probably does have some sort of simplified internal model of the world, who's to say it doesn't?


It's kind of creepy when GPT4 uses such life-like analogies.

You're an AI, you've never used crayons or played with legos.


I believe LoRA developed after GPT-4’s cutoff date. So this appears to have come from a human…


Quantizing is porting the weights from using high-precision floating point decimals to lower-precision numbers, e.g., 4-bit and 8-bit ints. Less memory used by the weights means we can run bigger models on the same hardware.

The idea behind quantization is that these models have so many parameters, they'll still work even if we reduce each node's flexibility.


In the same breath, there's no free lunch. There's always a trade-off. Sure, the model might now fit in your VRAM, but it might be less accurate for your specific task.


You can compress LLM models and run them in less ram. This matters because most people don't have access to powerful GPU clusters.


just ask it what is qlora and it will give wrong answers




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: