Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Another thing I tried was getting logic puzzles from the internet and giving them to 3.5 and 4. Both usually pass.

Then I alter them ever so slightly.

Then often times only GPT-4 passes.

From that I reckon 3.5 is doing more of a training data regurgitation. It can answer things in its training data. But 4 seems to have an ability to reason - or maybe it is better able to generalise?



Failure after being altered slightly doesn't necessarily mean they aren't capable of solving it.

That's a human failure mode as well that LLMs have adopted. If you really want to know if they can solve it don't stop there. Either, rewrite the question so it doesn't bias common priors or tell it it's making a wrong assumption.


I don’t doubt that - my point though is that maybe 3 can only solve things in its training data and 4 can figure things out.

3 seems to be more rigid. It needs babysitting to solve things. Which means it can only solve things I already know. 4 is more flexible and can solve things by itself.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: