Self-rewarding-lm-PyTorch: Self-Rewarding Language Model from MetaAI

starbugs · on Jan 25, 2024

Sorry if this is a dumb question, but how does that make sure that the training process is not going into the wrong direction because of error accumulation?

Maybe I didn't understand something fundamental here. (Not an LLM expert.)

huac · on Jan 25, 2024

I don't think it does. And there is a pretty big risk that you end up picking up on some quirk ("bias") of your reward model that doesn't reflect reality -- GPT4 preferring longer answers is one such commonly observed bias. AFAIK there is not a great theoretical basis for why we can avoid mode collapse, except empirically the models are good enough to survive some bootstrapping.

candiodari · on Jan 25, 2024

It doesn't.

I would like to add, there's plenty of examples, some in math (e.g. geometry) playing out over >1000 years and dozens of generations, of the same happening in humans.

That said, for both humans and this kind of LLMs, it does appear to improve performance, certainly in the near term.

starbugs · on Jan 25, 2024

Certainly there’s this problem in many areas.

I was just wondering how big of a deal that might be in this case. Just had another of those experiences with GPT4 where it goes into a contradictory loop it cannot recover from.

It seems there might be a big difference between long term cycles and short term severe degradation as in inbreeding and this paper‘s abstract sounded a bit like that to me.

If the results indicate improved performance, then it doesn’t seem to be that big of a deal (yet?).

Thanks for the explanation!

potatoman22 · on Jan 25, 2024

"Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613."

Cool and impressive. I'm curious if this training method will become more common.

lhl · on Jan 25, 2024

A new 7B model, Snorkel-Mistral-PairRM-DPO, using a similar self-rewarding pipeline was just released:

* Announcement: https://twitter.com/billyuchenlin/status/1749975138307825933

* Model Card: https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO

* Response Re-Ranker: https://huggingface.co/llm-blender/PairRM

"We would also like to acknowledge contemporary work published independently on arXiv on 2024-01-18 by Meta & NYU (Yuan, et al) in a paper called Self-Rewarding Language Models, which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model. While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models."

spidersenses · on Jan 25, 2024

>Snorkel-Mistral-PairRM-DPO

The naming of these models is getting ridiculous...

column · on Jan 25, 2024

I kind of disagree. It's not "user friendly" but it is very descriptive. They are codenames afterall. Take "dolphin-2.6-mistral-7b-dpo-laser" for instance : with a little LLM background knowledge, just from the name you know it is a 7 billion parameters model based on Mistral, with a filtered dataset to remove alignment and bias (dolphin), version 2.6 and using the techniques described in the Direct Preference Optimization (https://arxiv.org/pdf/2305.18290.pdf) and Laser (https://arxiv.org/pdf/2312.13558.pdf) papers to improve its output.

salad-tycoon · on Jan 25, 2024

And here I was thinking they were somehow using the first three words from my Bitcoin wallet.

spidersenses · on Jan 25, 2024

Thank you for a great and informative explanation despite my somewhat ignorant take.

I'm an occasional visitor to huggingface, so I'm actually superficially familiar with the taxonomy. I just felt like, even if I tried to satirize it, I wouldn't be able to come up with a crazier name. And that's not even the end of the Cambrian explosion of LLMs.

Loic · on Jan 25, 2024

A bit like the User-Agent string.

mark_l_watson · on Jan 25, 2024

Thanks for this, and the link you provided below for GGUF files! I just cleared my schedule this afternoon to kick the tires.

azinman2 · on Jan 25, 2024

I assume this doesn’t yet run on llama.cpp?

lhl · on Jan 25, 2024

here are some GGUFs https://huggingface.co/brittlewis12/Snorkel-Mistral-PairRM-D...

tarruda · on Jan 25, 2024

It is based on Mistral which llama.cpp supports, so I assume it does run (you might need to convert to GGUF format and quantize it).

greyface- · on Jan 25, 2024

From 4 days ago, the paper that this implementation is based on: https://news.ycombinator.com/item?id=39051279

dang · on Jan 25, 2024

Thanks! Macroexpanded:

Self-Rewarding Language Models - https://news.ycombinator.com/item?id=39051279 - Jan 2024 (58 comments)

frogamel · on Jan 25, 2024

I might be miunderstanding something here, but what complexity here is resolved by making this a framework? Isnt this just:

1. Train model like normal

2. Evaluate model using self

3. Use eval results for DPO finetune

lucidrains · on Jan 25, 2024

No, you aren't wrong. For ML people, it is quite simple and hopefully the final code reflects that

The aim is really to give a good base for follow up research / modifications, which I think there will be many for this paper

lucidrains · on Jan 25, 2024

hey, appreciate the interest! repo is not done yet, but probably will be around month's end

dannyw · on Jan 25, 2024

Hey lucidrains! Epicmafia was so much fun in its glory days :)

lucidrains · on Jan 25, 2024

Lol, hey Danny. Indeed it was great times

I may bring it back, rebuilt in rails and svelte

swyx · on Jan 25, 2024

> svelte

based. I helped start Svelte Society. please let me know if you need anything from the Svelte community, not sure how far along you are in the rebuild. prob can get a few volunteers for you.

RockRobotRock · on Jan 25, 2024

EpicMafia was seriously amazing. 13 years later, I still have friends I met on that game.

Thank you!

lucidrains · on Jan 25, 2024

Yeah I know

I've been to weddings from players who met on the site. It was magical

choppaface · on Jan 25, 2024

did you get training compute from HF or thru a16z e.g. andromeda or some private cluster?

greatpostman · on Jan 25, 2024

Meanwhile google still hasn’t released anything substantial

code51 · on Jan 25, 2024

Singular focus on AlpacaEval feels a bit limiting to validate the gains.

What's the evidence here that this is not just a kind of leaderboard hacking for LLMs?

nmitchko · on Jan 25, 2024

Great work, will try this tonight.

Only question, why do you name variables with the λ symbol?

lucidrains · on Jan 25, 2024

just to better match the math equations in the SPIN paper