Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Maybe 41.8% is the score of Qwen3-235B-A22B-Thinking-2507, lol. 11% for the non-thinking model is pretty high


Makes sense, it's in line with Gemini 2.5 Pro in that case. It aligns with their other results in the post.


They made it very clear that they were reporting that score for the non-thinking model[0]. I still don't have any guesses as to what happened here, maybe something format related. I can't see a motivation to blatantly lie on a benchmark which would very obviously be publicly corrected.

[0] https://x.com/JustinLin610/status/1947836526853034403




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: