It improves to 25.9 over the previous version of Claude 3.5 Sonnet (24.4) on NYT...

amarcheschi · 2024-10-22T19:25:52 1729625152

Perhaps it's just because English is not my native language, but the prompt 3 isn't quite clear at the beginning when it says "group of four. Words (...)". It is not explained what the group of four must be, if I add to the prompt "group of four words" Claude 3.5 manages to answer it, while without it, Claude tells it is not that clear and can't answer

jjice · 2024-10-22T18:08:21 1729620501

What a neat bench mark! I'm blown away that o1 absolutely crushes everyone else in this. I guess the chain of thought really hashes out those associations.

rkharsan64 · 2024-10-22T19:20:22 1729624822

Isn't it possible that o1 was also trained on this data (or something super similar) directly? The score seems disproportionately high.

usaar333 · 2024-10-23T01:26:09 1729646769

They definitely considered it. Early theinformation articles talked about how high the performance of strawberry was on it.