Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The paper (published 4 days ago) has this on page 10, and says that o1-mini failed to solve it correctly:

   Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?
I pasted it into ChatGPT and Claude, and all four models I tried gave the correct answer:

4o mini: https://chatgpt.com/share/6709814f-9ff8-800e-8aab-127b6f952d...

4o: https://chatgpt.com/share/6709816c-3768-800e-9eb1-173dfbb5d8...

o1-mini: https://chatgpt.com/share/67098178-4088-800e-ba95-9731a75055...

3.5 sonnet: https://gist.github.com/rahimnathwani/34f93de07eb7510d57ec1e...




Isn't it because this test has since been spread on the internet and the LLM's picked up on that so now they give the correct answer?

Maybe try a new unique logical question. And not the same question with a few words changed, because that might still match close to data the LLM already scanned.


  the LLM's picked up on that so now they give the correct answer
The models don't just 'pick up' information that appears on the internet. They must be retrained with that new data in the training set.

I tested the models 4 days after the paper was published.

The models are retrained every few months, and the process takes much more than 4 days.


Wonder if this is like the old school benchmarks people would cheat on. Should not be hard to assemble a series of such puzzles and get a read on overall accuracy :)


Remember. How. It works. Please, please remember how it works. It is generating an answer anew, every single time. It is amazing how often it produces a correct answer, but not at all surprising that it produces inconsistent and sometimes incorrect answers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: