Nice experiment, even though we know that LLMs distill an internal world model representation of whatever they are trained on.
The experiment could be a little better by using a more descriptive form of notation than PGN. PGN notation's strength is the shorthand properties of it, because it is used by humans while playing the game. That is far from being a strength as LLM training data. ML algorithms, and LLMs are trained better by feeding them more descriptive and accurate data, and verbosity is not a problem at all. There is the FEN notation in which in every move the entire board is encoded.
One could easily imagine many different ways to describe a game, like encoding vertical and horizontal lines, listing what exact squares each piece is covering, what color squares, which of the pieces are able to move, and in each move generate one whole page of the board situation.
I call this spatial navigation, in which the LLM learns the ins and outs of it's training data and it is able to make more informed guesses. Chess is fun and all, but code generation has the potential to be a lot better than just writing functions. By feeding the LLM the AST representation of the code, the tree of workspace files, public items, module hierarchy alongside with the code, it could be a significant improvement.
> Nice experiment, even though we know that LLMs distill an internal world model representation of whatever they are trained on.
There are still a lot of people who deny that (for example Bender's "superintelligent octopus" supposedly wouldn't learn a world model, no matter how much text it trained on), so more evidence is always good.
> There is the FEN notation in which in every move the entire board is encoded.
The entire point of this is to not encode the board state!
>The entire point of this is to not encode the board state!
I am not sure about this. From the article "The 50M parameter model played at 1300 ELO with 99.8% of its moves being legal within one day of training."
I thought that the experiment was how well the model will perform, given that it's reward function is to predict text, rather than checkmate. Leela, Alpha0 their reward function is to win the game, checkmate or capture pieces. Also it goes without saying that Leela, Alpha0 cannot make illegal moves.
The experiment does not need to include the whole board position if that's a problem, if that's an important point of interest. It could encode more information about squares covered by each side for example. See for example this training experiment for Trackmania [1]. There are techniques that the ML algorithm will *never* figure out by itself if this information is not encoded in it's training data.
The point still stands. PGN notation certainly is not a good format if the goal (or one of the goals) of the experiment is to be a good chess player.
That just shows that it worked in some sense. If it didn't reach any ELO, clearly the results would be uninformative: maybe it's impossible to learn chess from PGN, or maybe you just screwed up. He's clear that the point is to interrogate what it learns:
"This model is only trained to predict the next character in PGN strings (1.e4 e5 2.Nf3 …) and is never explicitly given the state of the board or the rules of chess. Despite this, in order to better predict the next character, it learns to compute the state of the board at any point of the game, and learns a diverse set of rules, including check, checkmate, castling, en passant, promotion, pinned pieces, etc. In addition, to better predict the next character it also learns to estimate latent variables such as the ELO rating of the players in the game."
By feeding the LLM the AST representation of the code, the tree of workspace files, public items, module hierarchy alongside with the code, it could be a significant improvement.
Aider does this, using tree-sitter to build a “repository map”. This helps the LLM understand the overall code base and how it relates to the specific coding task at hand.
More broadly, I agree with your sentiment that there is a lot of value in considering the best ways to structure the data we share with LLMs. Especially in the context of coding.
>Aider does this, using tree-sitter to build a “repository map”. This helps the LLM understand the overall code base and how it relates to the specific coding task at hand.
Great stuff.
>More broadly, I agree with your sentiment that there is a lot of value in considering the best ways to structure the data we share with LLMs. Especially in the context of coding.
As the experiments on PHI-1 and PHI-2 from microsoft show, training data make a difference. The "textbooks is all you need" moto means better structured data, more clear data make a difference.
> The experiment could be a little better by using a more descriptive form of notation than PGN
The author seems more interested in the ability to learn chess at a decent level from such a poor input, as well as what kind of world model it might build, rather than wanting to help it to play as well as possible.
The fact that it was able to build a decent model of the board position from PGN training samples, without knowing anything about chess (or that it was even playing chess) is super impressive.
It seems simple enough to learn that, for example, "Nf3" means that an "N" is on "f3", especially since predicting well requires you to know what piece is on each square.
However, what is not so simple is to have to learn - without knowing a single thing about chess - that "Nf3" also means that:
1) One of the 8 squares that is a knights move away from f3, and had an "N" on it, now has nothing on it. There's a lot going on there!
2) If "f3" previously had a different piece on it, that piece is now gone (taken) - it should no longer also be associated with "f3"
The experiment could be a little better by using a more descriptive form of notation than PGN. PGN notation's strength is the shorthand properties of it, because it is used by humans while playing the game. That is far from being a strength as LLM training data. ML algorithms, and LLMs are trained better by feeding them more descriptive and accurate data, and verbosity is not a problem at all. There is the FEN notation in which in every move the entire board is encoded.
One could easily imagine many different ways to describe a game, like encoding vertical and horizontal lines, listing what exact squares each piece is covering, what color squares, which of the pieces are able to move, and in each move generate one whole page of the board situation.
I call this spatial navigation, in which the LLM learns the ins and outs of it's training data and it is able to make more informed guesses. Chess is fun and all, but code generation has the potential to be a lot better than just writing functions. By feeding the LLM the AST representation of the code, the tree of workspace files, public items, module hierarchy alongside with the code, it could be a significant improvement.