Claude Sonnet 3.5 has outperformed o1 in most tasks based on my own anecdotal assessment. So much so that I'm debating canceling my ChatGPT subscription. I just literally do not use it anymore, despite being a heavy user for a long time in the past
Is a "reasoning" model really different? Or is it just clever prompting (and feeding previous outputs) for an existing model? Possibly with some RLHF reasoning examples?
OpenAI doesn't have a large enough database of reasoning texts to train a foundational LLM off it? I thought such a db simply does not exist as humans don't really write enough texts like this.
It's trained via reinforcement learning on essentially infinite synthetic reasoning data. You can generate infinite reasoning data because there are infinite math and coding problems that can be created with machine-checkable solutions, and machines can make infinite different attempts at reasoning their way to the answer. Similar to how models trained to learn chess by self-play have essentially unlimited training data.
We don't know the specifics of GPT-o1 to judge, but we can look at open weights model for an example. Qwen-32B is a base model, QwQ-32B is a "reasoning" variant. You're broadly correct that the magic, such as it is, is in training the model into a long-winded CoT, but the improvements from it are massive. QwQ-32B beats larger 70B models in most tasks, and in some cases it beats Claude.
I just tried QwQ 32B, i didn't know about it. I used it to generate, some code GPT generated 2 days ago perfect code without even sweating.
QwQ generated 10 pages of it's reasoning steps, and the code is probably not correct. [1] includes both answers from QwQ and GPT.
Breaking down it's reasoning steps to such an excruciating detailed prose is certainly not user friendly, but it is intriguing. I wonder what an ideal use case for it would be.
To my understanding, Anthropic realizes that they can’t compete in name recognition yet, so they have to overdeliver in terms of quality to win the war. It’s hard to beat the incumbent, especially when “chatgpt’ing” is basically a well understood verb.
They don't have a model that does o1-style "thought tokens" or is specialized for math, but Sonnet 3.6 is really strong in other ways. I'm guessing they will have an o1-style model within six months if there's demand