> Right now GPT-5, Claude Opus, Grok 4, Gemini 2.5 Pro all seem quite good across the board (ie they can all basically solve moderately challenging math and coding problems).
I wonder if that's because they have a lot of overlap in learning sets, algorithms used, but more importantly, whether they use the same benchmarks and optimize for them.
As the saying goes, once a metric (or benchmark score in this case) becomes a target, it ceases to be a valuable metric.
I wonder if that's because they have a lot of overlap in learning sets, algorithms used, but more importantly, whether they use the same benchmarks and optimize for them.
As the saying goes, once a metric (or benchmark score in this case) becomes a target, it ceases to be a valuable metric.