"Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613."
Cool and impressive. I'm curious if this training method will become more common.
"We would also like to acknowledge contemporary work published independently on arXiv on 2024-01-18 by Meta & NYU (Yuan, et al) in a paper called Self-Rewarding Language Models, which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model. While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models."
I kind of disagree. It's not "user friendly" but it is very descriptive. They are codenames afterall. Take "dolphin-2.6-mistral-7b-dpo-laser" for instance : with a little LLM background knowledge, just from the name you know it is a 7 billion parameters model based on Mistral, with a filtered dataset to remove alignment and bias (dolphin), version 2.6 and using the techniques described in the Direct Preference Optimization (https://arxiv.org/pdf/2305.18290.pdf) and Laser (https://arxiv.org/pdf/2312.13558.pdf) papers to improve its output.
Thank you for a great and informative explanation despite my somewhat ignorant take.
I'm an occasional visitor to huggingface, so I'm actually superficially familiar with the taxonomy. I just felt like, even if I tried to satirize it, I wouldn't be able to come up with a crazier name. And that's not even the end of the Cambrian explosion of LLMs.
Cool and impressive. I'm curious if this training method will become more common.