Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Also getting a perfect score on AIME (math) is pretty cool.

Tongue in cheek: if we progress linearly from here software engineering as defined by SWE bench is solved in 23 months.





Just a few months ago people were still talking about exponential progress. The fact that we’re already going for just linear progress is not a good sign

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

We are still at 7mo doubling time on METR task duration. If anything the rate is increasing if you bias to more recent measurements.


Linear growth on a 0-100 benchmark is quite likely an exponential increase in capability.

This got me thinking - is there any reasonable metric we could use to measure the intellectual capabilities of the most capable species on Earth that had evolved at each point in time? I wonder what kind of growth function we'd see.

Silly idea - is there an inter-species game that we could use in order to measure ELO?


Except it is sublinear. Sonnet 4 was 10.2% above sonnet 3.7 after 3 months.

We should all know that in the software world, the last 10% requires 90% of the effort!

Sublinear as demonstrated on a sigmoid scale is quite fast enough for me thank you.

The reason they get a perfect score on AIME is because every question on AIME had lots of thought put into it, and it was made sure that everything was a possible. SWE-bench, and many other AI benchmarks, have lots of eval noise, where there is no clear right answer, and getting higher than a certain percentage means you are benchmaxxing.

> SWE-bench, and many other AI benchmarks, have lots of eval noise

SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.

> where there is no clear right answer

This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.


Do you think a more messier math benchmark (in terms of how it is defined) might be more difficult for these models to get?

Pretty sure there is a subset of SWE bench problems that are either ill-posed or not possible with the intended setup; I think I remember seeing another company excluding a fraction of them for that reason. So maxing out SWEBench might only be ~95%.

I'm most interested to see the METR time horizon results - that is the real test for whether we are "on-trend"


That's why they made the swe verified. Verified excludes those



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: