No, this isn't quite right. LLMs are trained in stages:
1. Pre-training. In this stage, the model is trained on a gigantic corpus of web documents, books, papers, etc., and the objective is to predict the next token of each training sample correctly.
2. Supervised fine-tuning. In this stage, the model is shown examples of chat transcripts that are formatted with a chat template. The examples show a user asking a question and an assistant providing an answer. The training objective is the same as in #1: to predict the next token in the training example correctly.
3. Reinforcement learning. Prior to R1, this has mainly taken the form of training a reward model on top of the LLM to steer the model toward arriving at whole sequences that are preferred by human feedback (although AI feedback is a similar reward that is often used instead). There are different ways to do this reward model. When OpenAI first published the technique (probably their last bit of interesting open research?), they were using PPO. There are now a variety of ways to do the reward model, including methods like Direct Preference Optimization that don't use a separate reward model at all and are easier to do.
Stage 1 teaches the model to understand language and imparts world knowledge. Stage 2 teaches the model to act like an assistant. This is where the "magic" is. Stage 3 makes the model do a better job of being an assistant. The traditional analogy is that Stage 1 is the cake; Stage 2 is the frosting; and Stage 3 is the cherry on top.
R1-Zero departs from this "recipe" in that the reasoning magic comes from the reinforcement learning (stage 3). What DeepSeek showed is that, given a reward to produce a correct response, the model will learn to output chain-of-thought material on its own. It will, essentially, develop a chain-of-thought language that helps it accomplish the end goal. This is the most interesting part of the paper, IMO, and it's a result that's already been replicated on smaller base models.
1. Pre-training. In this stage, the model is trained on a gigantic corpus of web documents, books, papers, etc., and the objective is to predict the next token of each training sample correctly.
2. Supervised fine-tuning. In this stage, the model is shown examples of chat transcripts that are formatted with a chat template. The examples show a user asking a question and an assistant providing an answer. The training objective is the same as in #1: to predict the next token in the training example correctly.
3. Reinforcement learning. Prior to R1, this has mainly taken the form of training a reward model on top of the LLM to steer the model toward arriving at whole sequences that are preferred by human feedback (although AI feedback is a similar reward that is often used instead). There are different ways to do this reward model. When OpenAI first published the technique (probably their last bit of interesting open research?), they were using PPO. There are now a variety of ways to do the reward model, including methods like Direct Preference Optimization that don't use a separate reward model at all and are easier to do.
Stage 1 teaches the model to understand language and imparts world knowledge. Stage 2 teaches the model to act like an assistant. This is where the "magic" is. Stage 3 makes the model do a better job of being an assistant. The traditional analogy is that Stage 1 is the cake; Stage 2 is the frosting; and Stage 3 is the cherry on top.
R1-Zero departs from this "recipe" in that the reasoning magic comes from the reinforcement learning (stage 3). What DeepSeek showed is that, given a reward to produce a correct response, the model will learn to output chain-of-thought material on its own. It will, essentially, develop a chain-of-thought language that helps it accomplish the end goal. This is the most interesting part of the paper, IMO, and it's a result that's already been replicated on smaller base models.