This intensifies my theory that some of the older OAI models are far more capable than advertised but in ways that are difficult to productize.
How unlikely is it that in training of these models you occasionally run into an arrangement of data & hyperparameters that dramatically exceeds the capabilities of others, even if the others have substantially more parameters & data to work with?
It could also be as simple as OAI experimenting on different datasets. Perhaps Chess games were included in some GPT-3.5 training runs in order to see if training on chess would improve other tasks. Perhaps afterwards it was determined that yes, LLMs can play chess - but no let's not spend time/compute on this.
Would be a shame, because chess is an excellent metric for testing logical thought and internal modeling. An LLM that can pick up and unique chess game half way through and play it ideally to completion is clearly doing more than "predicting the next token based on the previous one".
Clearly the performance of the "instruct" LLM is due to some odd bug or other issue. I do not believe that it is fundamentally better at chess than the others, even if it was specifically trained on far more chess data, which is unlikely. Lack of correlation is also not causation.
How unlikely is it that in training of these models you occasionally run into an arrangement of data & hyperparameters that dramatically exceeds the capabilities of others, even if the others have substantially more parameters & data to work with?