ChatGPT is trained using a combination of supervised and unsupervised learning. ...

espadrine · on Dec 11, 2022

That is somewhat inaccurate.

ChatGPT is based on the InstructGPT weights, based on the GPT-3 weights. It is roughly the same number of parameters, as far as we can tell.

The GPT-3 weights were obtained by doing unsupervised pre-training (hence, GPT: generative pre-training): maximizing the likelihood of the model predicting the next word in a large dataset of human text.

The InstructGPT weights were obtained with supervised fine-tuning (SFT) by making the model generate text, and asking a human to show a better text completion (as described in InstructGPT). Then, they also asked humans to rank multiple generated outputs, which was used as the supervised training goal of a separate reward function. That small amount of ranked data unlocked the ability to rank a much larger amount of data through reinforcement learning using proximal policy optimization (PPO): the model generates an output, the reward function rates it, and the model weights are updated to achieve a higher reward.

The ChatGPT beta weights were obtained by doing that again, but asking the humans to make the completion conversational. Since they could only pay few humans, they opened the beta to ask a wider range of people to do SFT using the feedback feature, to train the final ChatGPT weights.

So, all in all, the parameter estimation is incorrect, the order of the training is not right, the description of the purpose of the supervised learning step is wrong, the defining part of the ChatGPT training process is not mentioned (because the InstructGPT paper came in 2022, after the knowledge cut-off), the description of the difference with GPT-3 is misleading.

ffactory · on Dec 11, 2022

Not sure why, but while reading the two paragraphs, I already had a strong feeling they are written by chat-gpt...