TinyLlama project aims to pretrain a 1.1B Llama model on 3T tokens

imjonse · on Sept 4, 2023

From the FAQ:

' Why would pretraining a 1.1B model for so long make sense? Doesn't it contradict the Chinchilla Scaling Law?

Above is the training loss curve taken from the Llama 2 paper. Here I quote from that paper: "We observe that after pretraining on 2T Tokens, the models still did not show any sign of saturation". That is why we believe pretraining a 1.1B model for 3T tokens is a reasonable thing to do. Even if the loss curve does not go down eventually, we can still study the phenomenon of saturation and learn something from it.'

It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down? Could they conceivably release a Llama 2.1 being checkpoints taken a month after 2.0 was 'cut'? Maybe the expected gain is too small compared to what can be gained with fine/instruct tuning afterward anyway?

jofi1 · on Sept 4, 2023

> It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down? Could they conceivably release a Llama 2.1 being checkpoints taken a month after 2.0 was 'cut'? Maybe the expected gain is too small compared to what can be gained with fine/instruct tuning afterward anyway?

Because choosing the LR decay requires knowing the # of steps in advance. LR is too small after the 2T tokens, and changing it afterwards doesn't tend to help.

https://twitter.com/sherjilozair/status/1687837844729966592

jph00 · on Sept 4, 2023

AFAIK re-warming it up and then gradually decreasing it again ought to work fine. Have you seen any research showing that it doesn't?

fpgaminer · on Sept 4, 2023

That would work, in that it would allow one to continue decreasing the loss, but I wouldn't say that it would work "fine". A model trained with restarts always performs worse than a model trained for the same duration without restarts.

two_in_one · on Sept 5, 2023

> A model trained with restarts always performs worse than a model trained for the same duration without restarts.

Citation would be nice. From my experience restart sometimes is required. When model gets unstable and 'explodes', or gets stuck in some local minima. This is common with GANs. I usually rollback the model a bit, but keep the latest discriminator. So that discriminator 'knows' what to expect. It works in most cases, except for the 'fatality', when model blows up no matter what. That's the end of training.

jph00 · on Sept 5, 2023

I haven't seen any researcher that supports your contention. SGDR (SGD with restarts) has been shown to work well. https://arxiv.org/abs/1608.03983

charcircuit · on Sept 4, 2023

You could manually increase the learning rate or change the decay at any time.

naillo · on Sept 4, 2023

The most plausible explanation I've seen (other than the carmack 'sudden grokking' beyond the cutoff idea) is that they're planning to release llama3 sooner than later with some arcitecture changes for even better performance, so it makes sense to dedicate resources there instead.

ftxbro · on Sept 4, 2023

> It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down?

If I remember correctly, it's because the main reason they trained multiple models was to show a scaling trend. Each model was trained using a chinchilla-optimal mix of model size, cpu amount, and parameter size. The point was to provide an empirical scaling law that could possibly be extrapolated to estimate the performance of more expensive models, like imagine a billion dollar model for which the model size, data size, and cpu amount is picked in the chinchilla optimal ratios.

On small models the chinchilla optimal scaling stops training the model even when the model is still improving.

The problem comes when people are actually using these small llama models rather than treating them as just data points. If you are actually using these models, what you want is one that is trained forever on as many tokens and training time as possible.

syntaxing · on Sept 4, 2023

This sounds like a really fun project, running small models would change a lot of industries like games in their example. But how do people afford these projects?! If I am doing my numbers right, it'll cost them 50K to train this model for 3T tokens.

jlokier · on Sept 4, 2023

That's less than a month's income for a few people on here. I recall a comment from an engineer at Nvidia a year or two ago saying $700k/year was about much they were paid, in response to someone else not believing those levels.

Get together 5 people in that position and it's less than a week's income for the group. That sounds doable as a hobby for those lucky people.

More realistically, it's within range for a grant, or use of someone else's hardware if they aren't using it, as the sibling comment from wongarsu said.

Also cloud vendors sometimes give out large batches of credits to startups and such as marketing incentive to get future customers.

wongarsu · on Sept 4, 2023

$38k, based on the "90 days using 16 A100-40G" and lambdalabs prices.

That's a lot for a hobby, but small enough that it might be running on a university machine (the TinyLlama devs provide a way to cite them and all seem to work or study at Singapore University of Technology) or could be sponsored (no indication of that now, but "people made an awesome model in our cloud" is good advertisement). Government grants or grants in general also aren't out of the question, especially for a topic with this much hype.

GaggiX · on Sept 4, 2023

>It means you can train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) in 32 hours with 8 A100.

They are training the model on 3000/22=136 times the value of the chinchilla scale. It will be interesting to see how much it will improve after way beyond this value.

npsomaratna · on Sept 4, 2023

Possibly a lot. See: https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...

isoprophlex · on Sept 4, 2023

Very interesting, thanks for sharing!

pluijzer · on Sept 4, 2023

I now come to understand that the technobable in Star Trek wasn't that well predicted, in the future we will not be reversing polarities by alligning field cores. Picard will have us align our llamas with chiwawas to get an alpacafied chinchilla model.

elpocko · on Sept 4, 2023

Lora and Alpaca at Tanagra.

DarmokJalad1701 · on Sept 4, 2023

Llama, when the loss fell.

kmlx · on Sept 4, 2023

from this episode if i’m not mistaken: https://en.m.wikipedia.org/wiki/Darmok

i watched that series so many times…

DarmokJalad1701 · on Sept 4, 2023

Hence my username.

koprulusector · on Sept 4, 2023

There’s should also be a tribble in there, somewhere.

sp332 · on Sept 4, 2023

Chinchilla predicts that you could get lower loss by training a larger model with that amount of data. But the model size in this case was chosen for other reasons, mostly speed of inference and cost of fine-tuning. So it's just irrelevant here.

GaggiX · on Sept 4, 2023

Well it's relevant if you want to compare the model trained optimally using the same amount of compute and this one parameter-bound to see how much you're trading.

cypress66 · on Sept 4, 2023

It's a bit amusing how people treat chinchilla scaling laws as a law of nature, when it's just about a certain architecture and dataset.

minimaxir · on Sept 4, 2023

A robust 1.1B model compared to a 7B model would be strongly appreciated. The bottleneck of Llama 2 7B is that inference latency is still infeasible for Production use cases unless you have a good supply of expensive A100; dropping it by an order of magnitude and letting it run on other cloud GPUs will open new opportunities.

brucethemoose2 · on Sept 4, 2023

> The bottleneck of Llama 2 7B is that inference latency is still infeasible for Production use cases unless you have a good supply of expensive A100

?? A 3060 or a slightly bigger AMD/Intel GPU can stream llama 7B about as fast as someone can read, if not faster. A somewhat bigger consumer GPU can batch it and serve dozens of users.

I use 13B finetunes on my 2020 14" laptop all the time, with 6GB of VRAM and 16GB of CPU RAM.

I have seen many people on HN say this, and I can't help but wonder why the optimized, quantized llama implementations are flying under the radar.

littlestymaar · on Sept 4, 2023

> ?? A 3060 or a slightly bigger AMD/Intel GPU can stream llama 7B about as fast as someone can read,

That's the thing: you need a whole GPU per concurrent user, this is insanely expensive if you want to run it as part of a SaaS (which is what most for-profit want to do). Of course running models locally is much better in almost every regard, but nobody is gonna be a billionaire with that…

mlyle · on Sept 4, 2023

Your point is anticipated by the next sentence in the comment you replied to:

"A somewhat bigger consumer GPU can batch it and serve dozens of users."

Did you not read it?

littlestymaar · on Sept 5, 2023

“dozens” doesn't really change the economics here. A SaaS can serve a thousand of concurrent users on computers that is the price of a 4090, so we're still 2 orders of magnitudes off compared to regular SaaS business models.

mlyle · on Sept 5, 2023

Sure it does:

- Most apps are not non-stop token generation for concurrent users-- ChatGPT's duty cycle at this is very low.

- A 4090 amortized over 4 years, working days & hours, is 20 cents per working hour; this is basically the same as the power going into it. It's less than a penny per hour per concurrent on a task like this.

- Hopefully you're using LLM to deliver value that's worth more than a penny per hour of the people using it.

- If you hit massive scale and want to buy A100s to improve the economics because you're drowning in business, you can go ahead and readily do that at that time...

littlestymaar · on Sept 5, 2023

> A 4090 amortized over 4 years, working days & hours, is 20 cents per working hour;

But that's not how it works: you need to have enough of it to accommodate for peek usage, but a good fraction of that isn't going to be running most of the time. You'd end up with a cost that's not too far from what Cloud providers are offering, which is a roughly 3 times that price. And you need to pay for the whole server hosting these GPUs (this less of a factor when you're using big GPUs like H100, but if you want to stick with consumer-grade GPUs, then the host is still a non-trivial fraction of the cost, and your supporting a server for a small bunch of concurrent users, which means your infra team is going to work with a massive pool of servers very quickly, with all the associated costs).

> It's less than a penny per hour per concurrent on a task like this.

It's still two orders of magnitude more expansive than any other SaaS business.

> Hopefully you're using LLM to deliver value that's worth more than a penny per hour of the people using it.

Maybe, but then again you're trying to build a service that has to add much more value than what the typical SaaS start-up provide.

Also regarding this:

> - Most apps are not non-stop token generation for concurrent users-- ChatGPT's duty cycle at this is very low.

ChatGPT is mostly being used by people who use it a few minutes per day, which is a nice place to be, but:

- this market is already taken by them, so your startup isn't gonna do the same.

- when you start integrating LLMs in tools you use routinely (an IDE being the typical example, then the token generation amount skyrockets).

mlyle · on Sept 5, 2023

> It's still two orders of magnitude more expansive than any other SaaS business.

Really? Some SaaS businesses have users doing things that generate tens of thousands of IOs per user request across spinning storage, or even far more.

> ChatGPT is mostly being used by people who use it a few minutes per day, which is a nice place to be, but:

I think you basically completely misunderstood everything I said. Here, the point was that someone using it is generating tokens a very large proportion of the time they're sitting in front of the service compared to most use cases-- but it's still only like 20% of the time.

We all have a pretty good understanding of the tradeoffs between owning hardware vs. elastic usage of a utility. We know that "peek usage" [sic] is higher than average (which is why there's a duty cycle correction in the calculation in the first place).

> - when you start integrating LLMs in tools you use routinely (an IDE being the typical example, then the token generation amount skyrockets).

It all depends. The system I just built and deployed does not need to be immediately responsive to end-users (users can tolerate a delay of a couple of minutes), with a few thousand tokens per user per week, and usage smeared pretty well over a several hour per day window. There's a lot of reasons (beyond economics) why moving it to a consumer GPU is attractive, but it won't be happy with a 1B parameter model.

littlestymaar · on Sept 5, 2023

> "peek usage" [sic]

You are very smart indeed…

mlyle · on Sept 5, 2023

This whole subthread is based on you misreading the original assertion (from someone else) and being off by a couple of orders of magnitude-- then pretty badly misreading me.

There's plenty of reasons why firms will want to run this stuff on-prem, both for their own usage and as a service. It probably will not be the majority of usage or zero, but instead a noticeable small chunk.

Yes, it's more expensive than many things, but not anywhere close to the most expensive service that people choose to run on-prem. And you can still support a decent userbase from a few computers, depending upon what you're doing.

brucethemoose2 · on Sept 4, 2023

A single GPU with a batch size of 1 can serve many users, higher batch sizes can serve many dozens, pool a few and you can serve a sizable userbase.

It may not be super profitable, but its not untenable either.

minimaxir · on Sept 4, 2023

LLMs are GPU compute-bound. If you infer at batch_size = 1 on a model like Llama 2 7B on a "cheap" GPU like a T4 or an L4 it'll use about 100% of the compute, which means you get no benefit from batching.

The exception is the A100 GPU which does not use 100% of GPU compute and therefore you get benefit from batching, but is hella expensive.

The economics are not simple, and in most cases "just use the ChatGPT API" is also the most cost-effective option anyways. A smaller 1.1B model (which would likely not be compute-bound) with similar performance to a 7B model may tip the scales.

brucethemoose2 · on Sept 4, 2023

> LLMs are GPU compute-bound.

From what I understand, they are severely bandwidth bound at a GPU batch size of 1. Even llama.cpp is fairly RAM speed bound on a CPU with much less compute than a GPU.

It's just that batching is quite inefficient without an implementation like this: https://www.anyscale.com/blog/continuous-batching-llm-infere...

two_in_one · on Sept 5, 2023

> "cheap" GPU like a T4 or an L4 it'll use about 100% of the compute,

LLM with batch_size=1 technically cannot use '100%' of GPU. Because it has to move a lot of data around and use different blocks of GPU. So, when tensor cores are used cuda cores are idle. Tensor cores are used for matrix multiplication, cuda cores for activation functions (I'm simplifying). Model has to use both at different times moving data between them. Meanwhile GPU monitor may report 100%. But it's still possible to insert another process. I think I've seen this idea in Pytorch docs.

As for 1.1B LLM, it would be nice. Interesting experiment anyway. I'm only afraid that with big and diverse dataset model will focus more on memorization and generic logic may not emerge. They aren't doing anything new in terms of architecture and training methods.

snovv_crash · on Sept 4, 2023

That's still wildly too expensive if you want to make a profitable service that is scalable beyond VC capital injections.

cypress66 · on Sept 4, 2023

1.1B with 3T tokens will never be comparable to 7B with 2T tokens.

And I'm not sure what you mean by inference latency being infeasible. Most people using thsss models at home don't even bother with the 7B and go straight to 13B because it's easy to run too and much smarter. And any cloud gpu can run 13B.

kristianp · on Sept 4, 2023

Could this be used as a source of speculative tokens for larger llama models?, as per https://github.com/ggerganov/llama.cpp/pull/2926

Also, when are we going to start seeing open weights MOE models being released?

thawab · on Sept 4, 2023

1- yes, Gorgie twetted he is looking into it[0].

2- The only 2 i know of are airoboros[1] and Hydra which is still in progress.

[0] https://x.com/ggerganov/status/1698667093711880687?s=46&t=Jp...

[1] https://github.com/jondurbin/airoboros#lmoe

kristianp · on Sept 4, 2023

Thanks. Yes, I've seen airoboros, it aims to use a mixture of fine-tunes of the base model if I recall correctly. Not a truly pre-trained MOE, but could be useful.

Hydra, is this it? https://github.com/SkunkworksAI/hydra-moe

thawab · on Sept 5, 2023

Yes, it's fine-tuned models, hopefully the community find use-cases where it will shine. Regarding Hydra, yes, that's the one. To stay updated, join the Discord mentioned in the repo.

Havoc · on Sept 4, 2023

What does “pretrain” mean in this context? It sounds like normal training

rodonn · on Sept 4, 2023

GPT stands for Generative Pre-trained Transformer.

The "main" training step using huge amounts of inputs is called pre-training. The idea is that after that pre-training, you might fine tune the model for your specific use case.

Havoc · on Sept 4, 2023

I see...that makes sense. Thanks for explaining

Filligree · on Sept 4, 2023

As opposed to fine-tuning or in-context learning. It really is normal training.

Mxbonn · on Sept 4, 2023

Couldn't immediately find it but who sponsors/pays for the compute?

sp332 · on Sept 4, 2023

The link that says you can watch cross-entropy loss live is locked or broken.

fragebogen · on Sept 4, 2023

Works now for me https://wandb.ai/lance777/lightning_logs/reports/metric-trai...

alexedw · on Sept 4, 2023

This is silly. Look at the loss and benchmark curves for the Pythia suite of models - the smaller models certainly did saturate and in fact began worsening.

2T not saturating on a 7B is very different from 3T on a 1B.

littlestymaar · on Sept 5, 2023

That's the point of the experiment actually…

RC_ITR · on Sept 4, 2023

Not to be a downer, but wasn’t one of OpenAI’s earliest discoveries that training small models on huge datasets leads to over-fitting?

It’s my understanding that the entire race to ever-more parameters was driven by that.

minimaxir · on Sept 4, 2023

A workaround to overfitting is to train on so much distinct data that the model can't overfit.

Newer large datasets like the ones used here optimize for diversity. (e.g. SlimPajama is a heavily-deduped dataset)

ljlolel · on Sept 4, 2023

Learn about the magic of double descent

RC_ITR · on Sept 4, 2023

https://openai.com/research/deep-double-descent

Yeah, the line keeps going down as the model gets bigger. What's your point? That there's a hump in the middle?

e12e · on Sept 4, 2023

Are they upsampling - whatever that means in the context of datasets?

AFAIU slim pajama is about 627B tokens, and Starcoder:

> approximately 250 Billion tokens.

Ed: I see TFA says:

> Combined Dataset Size - Around 950B tokens

> Total Tokens During Training - 3 trillion (slightly more than 3 epochs/1430k steps)

... but I'm not seeing how one becomes three? That's more like 1 trillion than 3 trillion tokens?

emikulic · on Sept 5, 2023

Three epochs means it sees each token three times. The dataset is ~1T like you said.

29athrowaway · on Sept 4, 2023

A tiny llama would be hard to distinguish from an alpaca.