comparing Loss between different training runs and hyperparameters isn't very ac...

versteegen · on March 30, 2023

True that a scaling law only applies to models within a family, which allows some but not full choice of hyperparamaters. And that most of the minimum loss is just due to the unpredictability of language, so 2.0 vs 1.8 bits should actually be thought of as (say) 0.3 vs 0.1 bits plus an irrelevant 1.7 bits of randomness.

I hadn't actually looked at the LLaMA paper, that's an interesting note. However AFAICT GPT3, LLaMA and Chinchilla do not use the same tokenizer, so their losses are not comparable. GPT2 and 3 use (the same) custom BPE tokenizer. LLaMa uses SentencePiece but that generates a vocabulary specific to the training data it's run on. Chinchilla used "a slightly modified SentencePiece tokenizer that does not apply NFKC normalisation. The vocabulary is very similar– 94.15% of tokens are the same as those used for training Gopher".

Even if there is a lot more text available, it doesn't mean it's good training material. And the better free sources are already used. E.g. LLaMa was trained on 64% of GitHub that had a compatible license (and you're not going to gather much more source code than that), all the free book texts they could find, all of arXiv, all English pages in CommonCrawl that classified as "reference" quality, etc. arXiv, for example, isn't all scientific papers ever, but it's a large fraction of them. All private emails stored by a large email service would probably be one of the biggest untapped valuable sources.

Lockal · on March 29, 2023

What does these numbers mean? For example, for Google isn't loss == 0? But it does not make Google a superintelligence.