My understanding is that there’s no RLHF model sitting on top of ChatGPT. There’s a separate reward model during training, but that reward model is just used to fine-tune the decoder model. This is sort of confusing in the literature, because they will switch from calling it a “language model” to a “policy”, once they’ve used PPO, but if you trace the citations back to [1], you can see that the policy is initialized with a pretrained language model (in that case, GPT-2).
What I don’t know is how they’re applying RLHF after user fine-tuning. Are they redoing the PPO with the original reward model after tuning on your input? Are they just letting it slide, and hoping that the fine-tuning doesn’t cause the model to forget the RLHF? It’s unclear from what I’ve read.
What I don’t know is how they’re applying RLHF after user fine-tuning. Are they redoing the PPO with the original reward model after tuning on your input? Are they just letting it slide, and hoping that the fine-tuning doesn’t cause the model to forget the RLHF? It’s unclear from what I’ve read.
[1] https://arxiv.org/pdf/1909.08593.pdf