thanks for doing this, honestly your writeup seems more valuable than the model weights lol
> But for what it's worth, my personal opinion is that LLaMA probably isn't OpenAI-grade -- there's a big difference between training a model in an academic setting vs when your entire company depends on it for wide-scale commercial success. I wasn't impressed that 30B didn't seem to know who Captain Picard was.
im new to benchmarking shenanigans but how is it that facebook was able to proclaim that it matched GPT3 performance on presumably standard LLM benchmarks? is there a good survey paper or blogpost on how to think about known deficiencies in benchmarks?
Because loss != quality. This was one of the most counterintuitive discoveries in ML for me. People treat the two as interchangeable, and to a certain extent — a controlled extent — they are.
But if your dataset doesn’t include a word about Captain Picard, no amount of training will get it to know about the USS enterprise. Yet your loss metrics will still reach that magical 2.1 value with time. (2.1 is pretty much “excellent” quality; below that means you’re probably overfitting and need a bigger dataset.)
Thanks for the comment friendo. I wasn’t sure if this would get any attention at all, but that made it worth it. Be sure to DM me on Twitter if you’d like to chat about anything ML related: basic questions are one of my favorite things to assist with too, so feel free.
Loss is a training-time measurement based on performance on the training objective.
The training objective is rarely the same as an end user task that is being benchmark.
For example, classically language models are training on next token prediction. The closest benchmark for that is perplexity[1], often reported on the WikiText-103 dataset.
Until around 2019 this was often reported, but since then most large language model papers have moved onto reporting more useful benchmarks. Some examples of this are question answering performance or maybe embedding performance.
Unfortunately there aren't great benchmarks (yet?) for generative tasks. Quality is quite hard to measure here in a systematic way (see, eg the issues with BLEU benchmarks in summarization benchmarks).
Because there are many benchmarks that measure different things.
You need to look at the benchmark that reflects your specific interest.
So in this case ("I wasn't impressed that 30B didn't seem to know who Captain Picard was") the closest relevant benchmark they performed is MMLU (Massive Multitask Language Understanding"[1].
In the LLAMA paper they publish a figure of 63.4% for the 5-shot average setting without fine tuning on the 65B model, and 68.9% after fine tuning. This is significantly better that the original GPT-3 (43.9% under the same conditions) but as they note:
> "[it is] still far from the state-of-the-art, that is 77.4 for GPT code-davinci-002 on MMLU (numbers taken from Iyer et al. (2022))"
InstructGPT[2] (which OpenAI points at as most relevant ChatGPT publication) doesn't report MMLU performance.
The capability of a language model I care about most is probably its ability to represent or simulate Captain Picard. In the sense of being good at creative tasks but also Captain Picard, specifically. Is OpenAI deliberately doing something different on purpose that makes their models better for this, or is just that OpenAI has a lot more copyrighted data in their dataset, as I noticed just now when skimming the Facebook paper for MMLU section and seems be what the Facebook folks think?
"A potential explanation is that we have used a limited amount of books and academic papers in our pre-training data, i.e., ArXiv, Gutenberg and Books3, that sums up to only 177GB, while these models were trained on up to 2TB of books. This large quantity of books used by Gopher, Chinchilla and PaLM may also explain why Gopher outperforms GPT-3 on this benchmark, while it is comparable on other benchmarks."
It's unclear exactly why it doesn't work as well for you.
I have two comments that may be useful:
1) It's very unclear how good the generative capabilities of LLAMA are generally. It benchmarks well for code generation, but for English there aren't really any good benchmarks around. There's good chance the larger model performs much better since generative capabilities seem to be a partially emergent capability.
2) If you just want to "make it work" I'd suggest downloading all the Star Trek scripts you can that include Captain Picard and fine tuning LLAMA using them. It's unclear how well this will work, but that is probably about as good as you can get.
If you care about this probably deeply, it's probably worth trying the same with some of the other open GPT-3 models (GPTJ, GPT-NEOX etc)
You can read the original LLaMA paper which is pretty accessible[1]. For example, they claim to outperform GPT-3 on HellaSwag benchmark ( finishing sentences ). You can find examples of unfinished sentences in the HellaSwag paper [2] on page 13. Unfortunately for LLaMA, most people would be probably just asking questions about Captain Picard and so on, and on this benchmark LLaMA significantly underperforms compared to OpenAI models (thats's from their paper).
> But for what it's worth, my personal opinion is that LLaMA probably isn't OpenAI-grade -- there's a big difference between training a model in an academic setting vs when your entire company depends on it for wide-scale commercial success. I wasn't impressed that 30B didn't seem to know who Captain Picard was.
im new to benchmarking shenanigans but how is it that facebook was able to proclaim that it matched GPT3 performance on presumably standard LLM benchmarks? is there a good survey paper or blogpost on how to think about known deficiencies in benchmarks?