I've been increasingly wondering if the field considering LLMs as a continuum as opposed to a set of distinct thresholds is leading to erroneous "rules of thumb" as most research on methodology is concentrated in smaller and more accessible model experimentation right now.
We generally recognize (nearly ad nauseum) that mouse models of medical research don't necessarily translate to humans.
Similarly, I'd imagine most would laugh at the idea that a neurology researcher who found the best way to get a fruit fly's brain to navigate a maze should extrapolate that methodology to a dolphin or a chimp's brain.
Maybe we should be defining "weight classes" for LLMs and grouping research based on those classes. Like "these are the techniques that work best for lightweight models" but not necessarily assuming those as a general rule of thumb for "heavyweight models."
Even something like the discussion of synthetic data on model collapse is a good example of where there might be a very significant difference in the effect on model quality for a cheaper and less sophisticated model generating synthetic data to feed back into itself and a much more complex and sophisticated model. Maybe the lesson is actually "recursive training on synthetic data leads to model collapse in lightweight and medium weight models."
So while the writeup is a great one on fine tuning 7B models with LoRA, I would be curious just what % of the recommendations hold true in replication for even just a 65B model.
> research on methodology is concentrated in smaller and more accessible model experimentation
Lets pull out a part of the article: "Note that my experiments also included two arithmetic benchmarks (they are included in my other more technical write-up), on which LoRA-finetuned models performed significantly worse than the pretrained base models. My hypothesis is that the model unlearned arithmetic because the Alpaca dataset did not contain corresponding examples. Whether the model completely lost the knowledge or whether it's because the model can't handle the instructions anymore would require further investigation. However, a takeaway here is that it's probably a good idea to include examples of each task you care about when finetuning LLMs."
The llm for your average customer service bot isn't going to have a lot of need for arithmetic, or an ability to code! Can we get smaller models without math, or take a big model and prune that section, or better yet tell it "see Wolfram alpha" ...
Stuff like this is how we go from PhD projects that are black boxes to, opaque systems that engineers can beat on till they are clear and well understood.
I'm not saying that smaller model task specialization is a bad thing. If anything the research kicked off with Orca into using more complex models to jumpstart fine tuning of much smaller models is probably my pick for the most important ML research trend of 2023.
But even in the example you bring up and your comments on it, I'd strongly recommend considering Goodhart's Law - turning a handful of measurements into the target by which we are throwing other things away to improve model scores on those measurements doesn't necessarily represent a path to best in class production feasibility.
I can imagine a number of edge cases where a customer service bot not having basic math capabilities could lead to issues ("Did the package come with at least 4 screws?" "It only had three" "Ok, great - I don't see any issues with the shipment and am denying the return request").
Further, many qualities which probably do matter for applications like customer service, like patience, empathy, or de-escalation - don't happen to be parts of the measurements any LLMs are being optimized to hit (even though they are almost certainly represented at least in part in the pretrained models given the presence in the data).
We've become a bit too focused on optimizing LLMs around measurements reflecting our anchoring biases of what AI should look like as imagined decades ago rather than evaluating the starting point and use cases as they actually occur as we might for a tool by any other name.
Though this is all an entirely different issue from whether different model sizes require their own best practices.
This is an exceptionally useful article. A few highlights:
* QLoRA works really well compared to LoRA if you need to save memory (at the cost of time)
* For small LoRAs, Adam has almost no memory usage penalty compared to SGD
* Multiple training epochs lower performance (!). To quote: "This performance decline is likely due to increased overfitting, which warrants additional investigation." (Note that this is LoRA overfitting, and unclear which layers it was enabled for for this experiment).
* The best results for alpha and r parameters in LoRA seems to be alpha = 2r.
* Better datasets are much better. 1k LIMA gives better results than 50k Alpaca
LoRA blew me away the first time I looked into it. Especially since you can host many LoRA adapters at once for a fraction of the cost of hosting an entire model by sharing the base between the adapters. I built a little tool to make LoRA fine-tuning easier. The adapters export to Huggingface. You can check it out here: https://app.haven.run
I fine tuned LLama-2 on code/comment generation (in python) for around $2 and was able to run it natively on an m1 macbook air. I can totally see smaller fine tuned LLM's being used locally on consumer devices in the future. I think people underestimate how cheap and efficient this stuff is.
I've actually built a service which lets you fine tune LLama-2/other llms by uploading a JSON dataset. I'm looking for feedback, the link is https://useftn.com.
I've been playing with llama-2, and I've been pleasantly surprised with its ability to process images. So you plan to offer image inputs on your fine tuning service?
I’ve been thinking about ways to compress information with ai for long distance transmission with LoRA radio for a while and now this LoRA in the news gets me all confused.
Ever since the author paywalled some of his useful posts, I stopped following him. I have read his ML book and I know he used to be a professor and is now working in the industry, and he’s quite famous in the field. That’s why I don’t understand why such a figure would even need the extra income generated by Substack’s paywall.
Nothing wrong with that. But it's strange that such a person (who undoubtedly makes $$$) wants to make some more $×10^-n (n ≥1) by paywalling his articles.
He worked at a public university until 2021, you can look up his salary as it's public information: $118,472.99 [0], not as much as your average mid-level software engineer.
Now he works at a startup [1], but not as a c-level, so he's likely making average startup software engineer salary (certainly more than a public university professor, but not exactly FIRE money).
It's amazing how much people begrudge anyone wanting to make some money for the tremendous effort the put in to helping people better understand an important subject. Anyone working in software can easily afford to support his work if they find it valuable.
Generally, anyone "famous" for teaching a technical topic is typically doing it as a charity even if people are paying for their content.
People paid for hundred upon hundreds of dollars just to access a course, but not willing to give for an article that provided the same stuff to you, for your own time and pace, while charging only a tiny fraction of the course?
Beats me...
We generally recognize (nearly ad nauseum) that mouse models of medical research don't necessarily translate to humans.
Similarly, I'd imagine most would laugh at the idea that a neurology researcher who found the best way to get a fruit fly's brain to navigate a maze should extrapolate that methodology to a dolphin or a chimp's brain.
Maybe we should be defining "weight classes" for LLMs and grouping research based on those classes. Like "these are the techniques that work best for lightweight models" but not necessarily assuming those as a general rule of thumb for "heavyweight models."
Even something like the discussion of synthetic data on model collapse is a good example of where there might be a very significant difference in the effect on model quality for a cheaper and less sophisticated model generating synthetic data to feed back into itself and a much more complex and sophisticated model. Maybe the lesson is actually "recursive training on synthetic data leads to model collapse in lightweight and medium weight models."
So while the writeup is a great one on fine tuning 7B models with LoRA, I would be curious just what % of the recommendations hold true in replication for even just a 65B model.