As far as I'm concerned, all of the other models are a waste of time to use in c...

dinobones · 2025-03-20T00:19:12 1742429952

Interesting... Most benchmarks show this model as being worse than o3-mini-high and sonnet3.7.

What difference are you seeing from these models that makes it better?

I say this as someone considering shelling out $200 for ChatGPT pro for this.

jbellis · 2025-03-20T00:53:33 1742432013

If you're in the habit of breaking down problems to Sonnet-sized pieces you won't see a benefit. The win is that o1pro lets you stop breaking down one level up from what you're used to.

It may also have a larger usable context window, not totally sure about that.

logankeenan · 2025-03-20T02:54:35 1742439275

> lets you stop breaking down one level up from what you're used to.

Can you provide an example of what you mean by this? I provide very verbose prompts where I know what needs to be done and just let AI “do” the work. I’m curious how this is different?

jbellis · 2025-03-20T09:29:43 1742462983

Partly it means you can tell it to do X and it will figure out that implies Y and Z without you having to spell it out

And partly it can actually execute more at the same time without starting to make mistakes

raylad · 2025-03-20T08:29:17 1742459357

Sonnet 3.7 and O1 Pro both have 200K context windows. But O1 Pro has a 100K output window, and Sonnet 3.7 has a 128K output window. Point for Sonnet.

I routinely put about 100K + of context into Sonnet 3.7 in the form of source code, and in the Extended mode, given the right prompt, it will output perhaps 20 large source files before having to make a "continue" request (for example if it's asked to convert a web app from templates to React).

I'm curious whether O1 Pro actually exceeds Sonnet 3.7 in Extended mode for coding or not. Looking forward to seeing some benchmarks.

consumer451 · 2025-03-20T12:33:52 1742474032

I am very curious how 3.7 and o1 pro perform in this regard:

> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.

https://arxiv.org/abs/2502.05167

futopy · 2025-03-23T22:49:04 1742770144

Anyone ever tries to restructure a 10K text? For example, structure a 45min - 1hr interview transcript in an organized way without losing any detailed numbers / facts / supporting evidence. I find that none of OpenAI's model is capable of this task. Models are trying to summarize and omitting details. I think such task does not require much intelligence, but surprisingly OpenAI's "large" context model cannot make it.

qeternity · 2025-03-20T12:35:07 1742474107

"Usable" is the key word here. Not all context is created equal.

Have a look at the RULER benchmark for a bit more detail.

Tiberium · 2025-03-20T01:02:11 1742432531

There actually were almost no benchmarks for o1 pro before because it wasn't on the API. o1 pro is a different model from o1 (yes, even o1 with high reasoning).

ldjkfkdsjnv · 2025-03-20T00:25:09 1742430309

I regularly push 100k+ tokens into it. So most of my code base/large portions. I use the Repo Prompt product to construct the code prompts. It finds bugs and solutions at a rate that is far better than others. I also speak into the prompt to describe my problem, and find spoken language is interpreted very well.

I also frequently download all the source code of libraries I am debugging, and when running into issues, pass that code in along with my own broken code. Its very good

Hugsun · 2025-03-20T00:38:42 1742431122

How long is it's thinking time when compared to o1?

The naming would suggest that o1-pro is just o1 with more time to reason. The API pricing makes that less obvious. Are they charging for the thinking tokens? If so, why is it so much more expensive if there are just more thinking tokens anyways?

Tiberium · 2025-03-20T01:20:56 1742433656

I think o1 pro runs multiple instances of o1 in parallel and selects the best answer, or something of the sort. And you do actually always pay for thinking models with all providers, OpenAI included. It's especially interesting if you remember the fact that OpenAI hides the CoT from you, so you're in fact getting billed for "thinking" that you can't even read yourself.

ldjkfkdsjnv · 2025-03-20T00:50:48 1742431848

I dont have the answers for you, I just know that if they charged 400$ a month I would pay it. It seems like a different model to me. I never use o3-mini or o3-mini-high. Just gpt4o or o1 pro