We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.
overall i REALLY like this paper and effort, but this part sounds like a bit of bullshit. they dont have the ability to implement retries and backoffs to deal with rate limits?