Also getting a perfect score on AIME (math) is pretty cool. Tongue in cheek: if ...

wohoef · 2025-09-29T17:39:23 1759167563

Just a few months ago people were still talking about exponential progress. The fact that we’re already going for just linear progress is not a good sign

theptip · 2025-09-30T03:20:01 1759202401

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

We are still at 7mo doubling time on METR task duration. If anything the rate is increasing if you bias to more recent measurements.

falcor84 · 2025-09-29T18:12:22 1759169542

Linear growth on a 0-100 benchmark is quite likely an exponential increase in capability.

falcor84 · 2025-09-30T10:50:15 1759229415

This got me thinking - is there any reasonable metric we could use to measure the intellectual capabilities of the most capable species on Earth that had evolved at each point in time? I wonder what kind of growth function we'd see.

Silly idea - is there an inter-species game that we could use in order to measure ELO?

usaar333 · 2025-09-29T19:39:59 1759174799

Except it is sublinear. Sonnet 4 was 10.2% above sonnet 3.7 after 3 months.

GoatInGrey · 2025-09-29T20:10:15 1759176615

We should all know that in the software world, the last 10% requires 90% of the effort!

baq · 2025-09-30T06:20:51 1759213251

Sublinear as demonstrated on a sigmoid scale is quite fast enough for me thank you.

crthpl · 2025-09-29T17:48:46 1759168126

The reason they get a perfect score on AIME is because every question on AIME had lots of thought put into it, and it was made sure that everything was a possible. SWE-bench, and many other AI benchmarks, have lots of eval noise, where there is no clear right answer, and getting higher than a certain percentage means you are benchmaxxing.

mbesto · 2025-09-29T18:28:31 1759170511

> SWE-bench, and many other AI benchmarks, have lots of eval noise

SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.

> where there is no clear right answer

This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.

mrshu · 2025-09-29T18:03:21 1759169001

Do you think a more messier math benchmark (in terms of how it is defined) might be more difficult for these models to get?

levocardia · 2025-09-29T19:48:10 1759175290

Pretty sure there is a subset of SWE bench problems that are either ill-posed or not possible with the intended setup; I think I remember seeing another company excluding a fraction of them for that reason. So maxing out SWEBench might only be ~95%.

I'm most interested to see the METR time horizon results - that is the real test for whether we are "on-trend"

typpilol · 2025-09-29T20:34:45 1759178085

That's why they made the swe verified. Verified excludes those