This intensifies my theory that some of the older OAI models are far more capabl...

lumost · 2024-11-14T18:47:35 1731610055

It could also be as simple as OAI experimenting on different datasets. Perhaps Chess games were included in some GPT-3.5 training runs in order to see if training on chess would improve other tasks. Perhaps afterwards it was determined that yes, LLMs can play chess - but no let's not spend time/compute on this.

Workaccount2 · 2024-11-14T22:49:37 1731624577

Would be a shame, because chess is an excellent metric for testing logical thought and internal modeling. An LLM that can pick up and unique chess game half way through and play it ideally to completion is clearly doing more than "predicting the next token based on the previous one".

selcuka · 2024-11-15T01:33:14 1731634394

> chess is an excellent metric for testing logical thought and internal modeling

Is it, though? Apparently nobody else cared to use it to benchmark LLMs until this article.

gs17 · 2024-11-15T18:32:46 1731695566

People had noticed this exact same discrepancy between 3.5-turbo-instruct and 4 a year ago: https://x.com/GrantSlatton/status/1703913578036904431

danjl · 2024-11-14T20:49:34 1731617374

Clearly the performance of the "instruct" LLM is due to some odd bug or other issue. I do not believe that it is fundamentally better at chess than the others, even if it was specifically trained on far more chess data, which is unlikely. Lack of correlation is also not causation.