We evaluate several frontier models on PaperBench, finding that the best-perform...

attentive · 2025-04-02T21:57:55 1743631075

"We wished to also evaluate Claude 3.7 Sonnet, but were unable to complete the experiments given rate limits with the Anthropic API."

swyx · 2025-04-02T22:49:38 1743634178

overall i REALLY like this paper and effort, but this part sounds like a bit of bullshit. they dont have the ability to implement retries and backoffs to deal with rate limits?

eightysixfour · 2025-04-02T23:03:55 1743635035

Because they used wall clock time, not compute time, flops, or watts, to standardize. 24 hours and 36 hours of compute.

They could build a system which gives them equal compute time by ignoring time spent rate limiting and such, but they chose not to.

swyx · 2025-04-03T03:39:52 1743651592

ah. fair answer.

moralestapia · 2025-04-03T01:53:36 1743645216

"Why don't they just break the TOS"

Damned if you do, damned if you don't.