Just a few months ago people were still talking about exponential progress.
The fact that we’re already going for just linear progress is not a good sign
This got me thinking - is there any reasonable metric we could use to measure the intellectual capabilities of the most capable species on Earth that had evolved at each point in time? I wonder what kind of growth function we'd see.
Silly idea - is there an inter-species game that we could use in order to measure ELO?
The reason they get a perfect score on AIME is because every question on AIME had lots of thought put into it, and it was made sure that everything was a possible. SWE-bench, and many other AI benchmarks, have lots of eval noise, where there is no clear right answer, and getting higher than a certain percentage means you are benchmaxxing.
> SWE-bench, and many other AI benchmarks, have lots of eval noise
SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.
> where there is no clear right answer
This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.
Pretty sure there is a subset of SWE bench problems that are either ill-posed or not possible with the intended setup; I think I remember seeing another company excluding a fraction of them for that reason. So maxing out SWEBench might only be ~95%.
I'm most interested to see the METR time horizon results - that is the real test for whether we are "on-trend"
Tongue in cheek: if we progress linearly from here software engineering as defined by SWE bench is solved in 23 months.