Nice experiment, even though we know that LLMs distill an internal world model r...

gwern · on Jan 7, 2024

> Nice experiment, even though we know that LLMs distill an internal world model representation of whatever they are trained on.

There are still a lot of people who deny that (for example Bender's "superintelligent octopus" supposedly wouldn't learn a world model, no matter how much text it trained on), so more evidence is always good.

> There is the FEN notation in which in every move the entire board is encoded.

The entire point of this is to not encode the board state!

emporas · on Jan 7, 2024

>The entire point of this is to not encode the board state!

I am not sure about this. From the article "The 50M parameter model played at 1300 ELO with 99.8% of its moves being legal within one day of training."

I thought that the experiment was how well the model will perform, given that it's reward function is to predict text, rather than checkmate. Leela, Alpha0 their reward function is to win the game, checkmate or capture pieces. Also it goes without saying that Leela, Alpha0 cannot make illegal moves.

The experiment does not need to include the whole board position if that's a problem, if that's an important point of interest. It could encode more information about squares covered by each side for example. See for example this training experiment for Trackmania [1]. There are techniques that the ML algorithm will *never* figure out by itself if this information is not encoded in it's training data.

The point still stands. PGN notation certainly is not a good format if the goal (or one of the goals) of the experiment is to be a good chess player.

[1]https://www.youtube.com/watch?v=Dw3BZ6O_8LY

gwern · on Jan 8, 2024

That just shows that it worked in some sense. If it didn't reach any ELO, clearly the results would be uninformative: maybe it's impossible to learn chess from PGN, or maybe you just screwed up. He's clear that the point is to interrogate what it learns:

"This model is only trained to predict the next character in PGN strings (1.e4 e5 2.Nf3 …) and is never explicitly given the state of the board or the rules of chess. Despite this, in order to better predict the next character, it learns to compute the state of the board at any point of the game, and learns a diverse set of rules, including check, checkmate, castling, en passant, promotion, pinned pieces, etc. In addition, to better predict the next character it also learns to estimate latent variables such as the ELO rating of the players in the game."

anotherpaulg · on Jan 7, 2024

By feeding the LLM the AST representation of the code, the tree of workspace files, public items, module hierarchy alongside with the code, it could be a significant improvement.

Aider does this, using tree-sitter to build a “repository map”. This helps the LLM understand the overall code base and how it relates to the specific coding task at hand.

https://aider.chat/docs/repomap.html

More broadly, I agree with your sentiment that there is a lot of value in considering the best ways to structure the data we share with LLMs. Especially in the context of coding.

emporas · on Jan 7, 2024

>Aider does this, using tree-sitter to build a “repository map”. This helps the LLM understand the overall code base and how it relates to the specific coding task at hand.

Great stuff.

>More broadly, I agree with your sentiment that there is a lot of value in considering the best ways to structure the data we share with LLMs. Especially in the context of coding.

As the experiments on PHI-1 and PHI-2 from microsoft show, training data make a difference. The "textbooks is all you need" moto means better structured data, more clear data make a difference.

https://arxiv.org/abs/2306.11644

HarHarVeryFunny · on Jan 7, 2024

> The experiment could be a little better by using a more descriptive form of notation than PGN

The author seems more interested in the ability to learn chess at a decent level from such a poor input, as well as what kind of world model it might build, rather than wanting to help it to play as well as possible.

The fact that it was able to build a decent model of the board position from PGN training samples, without knowing anything about chess (or that it was even playing chess) is super impressive.

It seems simple enough to learn that, for example, "Nf3" means that an "N" is on "f3", especially since predicting well requires you to know what piece is on each square.

However, what is not so simple is to have to learn - without knowing a single thing about chess - that "Nf3" also means that:

1) One of the 8 squares that is a knights move away from f3, and had an "N" on it, now has nothing on it. There's a lot going on there!

2) If "f3" previously had a different piece on it, that piece is now gone (taken) - it should no longer also be associated with "f3"