Maybe 41.8% is the score of Qwen3-235B-A22B-Thinking-2507, lol. 11% for the non-...

jug · 2025-07-25T14:38:49 1753454329

Makes sense, it's in line with Gemini 2.5 Pro in that case. It aligns with their other results in the post.

christianqchung · 2025-07-25T15:49:42 1753458582

They made it very clear that they were reporting that score for the non-thinking model[0]. I still don't have any guesses as to what happened here, maybe something format related. I can't see a motivation to blatantly lie on a benchmark which would very obviously be publicly corrected.

[0] https://x.com/JustinLin610/status/1947836526853034403