Hacker News new | past | comments | ask | show | jobs | submit login
Chinchilla’s death (espadrine.github.io)
169 points by KolmogorovComp on Sept 4, 2023 | hide | past | favorite | 42 comments



Chinchilla's death has been greatly exaggerated. This article makes the same mistake as in the original GPT-3 scaling law of extrapolating from mid-training loss curves- but most of the loss improvement in the middle of training comes from simply dropping the learning rate to reduce the effective noise level from stochastic gradients.

If we want to judge the effectiveness of long-training small models, we need to look at _final_ loss as a function of compute, adjusting our LR schedule to token budget as we spend more compute, and then extrapolate on that curve- _not_ the training curve for a fixed budget. Another way to put it: you can't drop LR below 0, and LR schedule drives the shape of the training curve, so it doesn't make sense to extrapolate a curve beyond the end of training.

Of course, the overall point that long training produces gains holds true and Chinchilla says nothing about this- it only aims to minimize training compute.


I find in the article the assumption almost insulting, that labs haven’t tried training smaller models for longer. Everybody tries that first, of course. It was the common wisdom for decades. No, it doesn’t work better or even as well. The loss curve flattens out.


On the other hand models are trained once and later used a lot so you can argue that training cost can be traded for future gains.


The discussion at the end of this article starts to get to the problem with extrapolating.

Llama1-65b (roughly Chinchilla optimal) and Llama2-34b used similar compute and although Llama2 not directly comparable is the closest comparator without extrapolating and illustrates the point.


While the article makes good observations, this would appear to be a major oversight by leading research labs if they could have just kept the gas pedal down on simpler models for longer and gotten better performance. This is HackerNews – can we get someone from OpenAI, DeepMind, or MetaAI to respond and justify why cutting off the smaller models at a lower total compute budget is justified?


But...they did that, with Llama 2, and apparently did get better results, at least up to a point.

My big WTF is, if you're feeding the same amount of data through all of them, then to use the same amount of compute for the smaller models you need to run multiple epochs (with the same data). One thing that's always bothered me a bit about "foundation model" LLM training is that it sounds like traditionally they essentially just run a single epoch, and with stochastic gradient descent that's certainly leaving something on the table, probably a lot (and also introduces a lot of path-dependence on the order in which data is presented, what with the cosine learning rate rules).

I really want to know what would happen if, like in smaller models where we can actually get there (convnets for ImageNet classification, e.g.), we ran enough epochs on each of these models to hit the point where validation loss started increasing even as test loss decreased. It seems like we're always squarely in the realm where they're still both decreasing, so everything is severely undertrained, even given the available datasets. It's easy to come up with "laws" for that regime, but they mean nothing other than that we don't have enough compute to properly handle the data.

Big takeaway: if the results from this article are legit, it would suggest that we should really be looking at even smaller models, wouldn't it? And actually be training them to the "risk overtraining" point?


You might want to give a read to "Scaling Data-Constrained Language Models" [1]. They basically generalized the Chinchilla scaling law by investigating behavior on multi-epoch runs.

[1] https://arxiv.org/abs/2305.16264


The Llama 1 paper [1] was one of the earlier models to question the assumption that more params = better model. Since then they've released Llama 2 and this post is offering more evidence that reinforces their hypothesis.

I wouldn't say it was an oversight by other labs that they missed this. It's easier to just increase params on a model over the same training set instead of gathering a larger training set necessary for a smaller model. And at first, increasing model size did seem to be the way forward, but we've since hit diminishing returns. Now that we've hit that point, we've begun exploring other options and the Llamas are early evidence of another way forward.

[1] https://arxiv.org/abs/2302.13971


I work with LLMs, won't say where, but smaller models stop performing better on benchmarks after a certain point, e.g. they seem to hit their learning capacity (at least with current techniques). Small models struggle to keep context the way larger models do, their outputs are impressive, but lack a certain amount of logical consistency and flow.

Whether this is a fundamental issue with the model size or some combination of training technique and model size is yet to be known. But for now, we know what works and are exploring that until we squeeze all the 'easy' perf we can.


One noteworthy thing is that no one is posting validation curves, only training curves. All these models will happily bring training loss eventually to near zero with infinite compute, as the model overfits to the dataset -- there are no regularizers in any modern LLMs. The validation curves would be considerably more convincing.

The counter argument to above is that none of these models were really trained for multiple-epochs: it's hard to overfit data you've only seen once. But to go to 70T tokens, you'd inevitably have to start using many epochs.


The validation curves will look identical. These models are far too small to overfit to the training set.

With a large enough model and many epochs, you can certainly get overfitting, but for one epoch val/train curves look exactly the same and I'd expect that a 7B model will never overfit on 2T tokens no matter how many epochs you do.


> data you've only seen once

Is this still true given that they're upsampling in the pretraining dataset? I don't recall any details on how and to what extent they did this in the Llama2 paper but presumably some fraction of those 2T training tokens is repeated data.

MetaAI hasn't been as averse to repeated tokens as other groups, they trained the now forgotten about Galactica for multiple epochs with good results.

> The validation curves would be considerably more convincing.

What are they validating on? I was under the impression they weren't splitting the pretraining corpus.


The llama1 team did not have a validation set. I don’t know what the Llama2 team did - I left before seeing any of the details.

My guess is Llama2 upsamples Wikipedia a good bit, but given they didn’t report any information about training data, it’s hard to say.


> there are no regularizers in any modern LLMs.

Using a large & diverse training set is the best regulariser, but I think there is also weight decay and dropout in transformers


RWKV also uses some sort of L2-esque regularization, which was supposedly an idea taken from PaLM (although I can't find a source on this point, other than some message in the RWKV discord)


A related learning rate observation that is obvious in hindsight but not if you are just "tweaking" the learning rate: if you decay the learning rate exponentially, then you can only travel a bounded distance in parameter space, which may not be enough to reach a minimum. In practice that doesn't seem to be a problem, but then again, a cosine learning rate schedule doesn't look like a problem either.


Adam/AdamW doesn't follow any intuitive logic. Pretty much everyone has different experiments with setting learning rates and schedules.


That's fine - you can just use SGDR (SGD with restarts). https://arxiv.org/abs/1608.03983


These LLMs with cosine learning rate are usually decayed just to 1/10th the peak LR. Even if you used exponential decay, your LR on your final step could still be arbitrarily large depending on how fast you configure it to decay.


Aren't we running out of text to train AI? You need much more data to train a smaller LLM to match a larger LLM performance. Considering how everybody is starting to clamp down on their user's data, we might have issues getting enough to train something on the scale of PaLM to reach optimal performance.

On that note, everyone is probably salivating over Discord's data. So so much chat based and detailed conversations, all neatly organized into groups...


I am wondering how much value social network data like Twitter or Discord really has when compared to something like Wikipedia or textbooks.

It is mostly unformatted, contains slang and is often far from the truth. Is this data really all that useful for training an LLM?


Unformatted is irrelevant (or a benefit).

That is contains slang is a huge benefit - how else will it learn slang?

That is often far from the truth is of mixed value. Firstly it is very unclear that a LLM is the best method for holding facts. Secondly, the "untrue" data at least gives the LLM an idea of what could be true given a context, which lets it learn a better model of the world.

For example, if it gets a lot of text wrongly claiming that "Winston Churchill is the 40th President of the United States" it will add to the evidence that "The President of the United States" and "Winston Churchill" are both in the class of "people".

This is opposed to nonsense text like "A carpet States President United" (which is just noise).


Why do we want the LLMs to understand slang ? Will it use slang to cure diseases ?


Who is this "we" you speak of? Because the people using them to generate lyrics certainly want slang.

But yes, there are plenty of scenarios where knowledge of slang could cure diseases.

Consider mental health, where one of the main effective interventions is diarying. If a LLM can understand the slang in a diary then it is certainly possible it could intervene successfully.


Sounds like a joke.


Social media has more idiosyncratic and human-esque text.

People already complain that ChatGPT is too formal and verbose, as an AI language model.


That's probably more because of RLHF though, they've optimised for certain kind of responses rather than simple model loss on internet text.


It's probably more training-compute intensive, but they can do drop-out, right? The strategy they used for ImageNet recognition, when they were using supervised learning and training data was scarse.


Dropout is one strategy for regularization but doesn't guarantee avoiding overfitting, especially now that modern AI models generalize much better than they did during the ImageNet days. Many of the big LLMs use a dropout of 0.1 though.


Using lots of low quality data is counterproductive. Unless you're specifically trying to imitate flawed humans.


Yep we are running out of text data, see https://doi.org/10.1098/rsos.221454


Coincidentally, or not? A research team just started a 90 day effort to train a 1.1 billion parameter llama model on 3 trillion tokens.

https://github.com/jzhang38/TinyLlama


Not a coincidence, as your link was submitted 16 hours ago: https://news.ycombinator.com/item?id=37379984 and this submission is a top-level repost of a link posted in the comments there.


I'm a strong opponent of just how large Large Language Models have become. Yeah, they can do some cool stuff, but I really, really think >100B parameters is a rather silly number. In my opinion, the way you organize and train the parameters is far more important than the parameter count, which I think TFA supports pretty solidly. While I've never personally overseen the training of a model bigger than 2B params, I have trained a lot of smaller models and they consistently surprise me in how capable they are when trained in clever ways on adequately diverse datasets.

My wild, unfounded speculation? AGI in 24GB VRAM is feasible :P


Chinchilla scaling was only good for academics - it prioritises model training cost at the expense of serving cost. If you plan to, you know, actually use the model, not just publish a paper on it, you need the LLaMA approach of training smaller models past the Chinchilla point.

But the smaller the model, the more compute is needed to cover the gap. To reach the loss of a 67B model

- A 33B model needs 2.3x compute

- A 13B model needs 25x compute

- A 7B model needs 7837x compute

- A 3B model can't match the 67B. Ever.


> Chinchilla scaling was only good for academics

I don't know if it's only good of academics, the point as the paper (as it says) is a scaling law for optimal loss given a fixed compute budget. By design it doesn't address inference costs and isn't a recipe for "how you should train a LLM for your use case".

If you're serving LLMs in a low throughput high cost scenario optimizing loss at the expense of inference cost may very well be your goal, or if you cant pay up front for 25x compute.


What is the likelihood of overfitting the smaller models? It’s not obvious what the criteria and hyperparams are that prevent that.

If there’s no overfitting and the results get reproduced then this is a very promising find.


To elaborate on my comment on the other thread, a workaround to overfitting is to train on so much distinct data that the model can't overfit. Newer large datasets optimize for diversity, as the AI industry is slowing coming to the realization that better data is much more important than large amount of bad data for training LLMs.

The writeup of SlimPajama, a heavily-deduped dataset, is a good starting point: https://www.cerebras.net/blog/slimpajama-a-627b-token-cleane...


this is sort of the bigger picture observation I have about LLMs in general: that they seem to be massively over-parameterised, which in a classical statistical world would suggest they will readily over fit and it will be hard to get them to generalise beyond regurgitating their training material. Yet this is not the behaviour we see. So in some sense the "magic" we have seen flourish is that LLMs have managed to devise training mechanisms that do let them learn without over fitting in the context of a large excess of parameters. Part of it may be the sheer volume of training material but I suspect even with that you would get straight up regurgitation because it is so hard to stop the input data being degenerate. So other factors about how the training works must be contributing to this.

Of course, this is all my semi-layman speculation about this and I'm curious what actually knowledgeable people think.


What is a pareto frontier? what?? and what is "FLOPS" in the graph? I don't understand anything!


Maybe it’s a tautochrone curve?


maybe it is just a sierpinski triangle




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: