If you're in the habit of breaking down problems to Sonnet-sized pieces you won't see a benefit. The win is that o1pro lets you stop breaking down one level up from what you're used to.
It may also have a larger usable context window, not totally sure about that.
> lets you stop breaking down one level up from what you're used to.
Can you provide an example of what you mean by this? I provide very verbose prompts where I know what needs to be done and just let AI “do” the work. I’m curious how this is different?
Sonnet 3.7 and O1 Pro both have 200K context windows. But O1 Pro has a 100K output window, and Sonnet 3.7 has a 128K output window. Point for Sonnet.
I routinely put about 100K + of context into Sonnet 3.7 in the form of source code, and in the Extended mode, given the right prompt, it will output perhaps 20 large source files before having to make a "continue" request (for example if it's asked to convert a web app from templates to React).
I'm curious whether O1 Pro actually exceeds Sonnet 3.7 in Extended mode for coding or not. Looking forward to seeing some benchmarks.
I am very curious how 3.7 and o1 pro perform in this regard:
> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.
Anyone ever tries to restructure a 10K text? For example, structure a 45min - 1hr interview transcript in an organized way without losing any detailed numbers / facts / supporting evidence. I find that none of OpenAI's model is capable of this task. Models are trying to summarize and omitting details. I think such task does not require much intelligence, but surprisingly OpenAI's "large" context model cannot make it.
There actually were almost no benchmarks for o1 pro before because it wasn't on the API. o1 pro is a different model from o1 (yes, even o1 with high reasoning).
I regularly push 100k+ tokens into it. So most of my code base/large portions. I use the Repo Prompt product to construct the code prompts. It finds bugs and solutions at a rate that is far better than others. I also speak into the prompt to describe my problem, and find spoken language is interpreted very well.
I also frequently download all the source code of libraries I am debugging, and when running into issues, pass that code in along with my own broken code. Its very good
How long is it's thinking time when compared to o1?
The naming would suggest that o1-pro is just o1 with more time to reason. The API pricing makes that less obvious. Are they charging for the thinking tokens? If so, why is it so much more expensive if there are just more thinking tokens anyways?
I think o1 pro runs multiple instances of o1 in parallel and selects the best answer, or something of the sort. And you do actually always pay for thinking models with all providers, OpenAI included. It's especially interesting if you remember the fact that OpenAI hides the CoT from you, so you're in fact getting billed for "thinking" that you can't even read yourself.
I dont have the answers for you, I just know that if they charged 400$ a month I would pay it. It seems like a different model to me. I never use o3-mini or o3-mini-high. Just gpt4o or o1 pro