Let me know when they can perform that well without a 300-shot. Or that well on ...

aoeusnth1 · on Feb 8, 2025

Two years give or take 6 months.

mrbungie · on Feb 9, 2025

Give a group of "average human" two years, give or take 6 months, and they will also saturate the benchmark and probably some humans would beat the SOTA LLM/RLM.

People tend to do so all the time, with games for example.

aoeusnth1 · on Feb 9, 2025

Average humans cannot be copy-pasted.

daveguy · on Feb 10, 2025

Average companies also don't pay humans to complete a benchmark consisting of a fixed set of problems.

refulgentis · on Feb 9, 2025

Done (link says 6 samples?)

daveguy · on Feb 9, 2025

> OpenAI shared they trained the o3 we tested on 75% of the Public Training set.

I'm talking transfer learning and generalization. A human who has never seen the problem set can be told the rules of the problem domain and then get 85+% on the rest. o3 high compute requires 300 examples using SFT to perform similarly. An impressive feat, but obviously not enough to just give an agent instructions and let it go. 300 examples for human level performance on the specific task, but that's still impressive compared to SOTA 2 years ago. It will be interesting to see performance on ARC-AGI-2.