Hacker News new | past | comments | ask | show | jobs | submit login

It improves to 25.9 over the previous version of Claude 3.5 Sonnet (24.4) on NYT Connections: https://github.com/lechmazur/nyt-connections/.



Perhaps it's just because English is not my native language, but the prompt 3 isn't quite clear at the beginning when it says "group of four. Words (...)". It is not explained what the group of four must be, if I add to the prompt "group of four words" Claude 3.5 manages to answer it, while without it, Claude tells it is not that clear and can't answer


What a neat bench mark! I'm blown away that o1 absolutely crushes everyone else in this. I guess the chain of thought really hashes out those associations.


Isn't it possible that o1 was also trained on this data (or something super similar) directly? The score seems disproportionately high.


They definitely considered it. Early theinformation articles talked about how high the performance of strawberry was on it.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: