The paper (published 4 days ago) has this on page 10, and says that o1-mini fail...

lewhoo · 2024-10-11T20:18:40 1728677920

I got 185 from 4o mini https://chatgpt.com/share/670987da-6e70-800b-b5a6-f8548fda6b...

gitaarik · 2024-10-12T05:42:33 1728711753

Isn't it because this test has since been spread on the internet and the LLM's picked up on that so now they give the correct answer?

Maybe try a new unique logical question. And not the same question with a few words changed, because that might still match close to data the LLM already scanned.

rahimnathwani · 2024-10-12T16:44:58 1728751498

  the LLM's picked up on that so now they give the correct answer

The models don't just 'pick up' information that appears on the internet. They must be retrained with that new data in the training set.

I tested the models 4 days after the paper was published.

The models are retrained every few months, and the process takes much more than 4 days.

bitexploder · 2024-10-11T20:06:52 1728677212

Wonder if this is like the old school benchmarks people would cheat on. Should not be hard to assemble a series of such puzzles and get a read on overall accuracy :)

LgLasagnaModel · 2024-10-11T21:46:07 1728683167

Remember. How. It works. Please, please remember how it works. It is generating an answer anew, every single time. It is amazing how often it produces a correct answer, but not at all surprising that it produces inconsistent and sometimes incorrect answers.