If it really is Reinforcement Learning as they claim, it means there might not b...

If it really is Reinforcement Learning as they claim, it means there might not be any direct supervision on the "thinking" section of the output, just on the final answer.

Just like for Chess or Go you don't train a supervised model by giving it the exact move it should do in each case, you use RL techniques to learn which moves are good based on end results of the game.

In practice, there probably is some supervision to enforce good style and methodology. But the key here is that it is able to learn good reasoning without (many) human examples, and find strategies to solve new problems via self-learning.

If that is the case it is indeed an important breakthrough.