As described in the OP's blog post https://adamkarvonen.github.io/machine_learni...

goatlover · on Jan 7, 2024

What's the strength of play for the GPT architecture? It's impressive that it figures out the rules, but does it play strong chess?

>> As they say, attention may indeed be all you need.

I don't think drawing general conclusions about intelligence from a board game is warranted. We didn't evolve to play chess or Go.

eigenket · on Jan 7, 2024

> What's the strength of play for the GPT architecture?

Pretty shit for a computer. He says his 50m model reached 1800 Elo (by the way, its Elo and not ELO as the article incorrectly has it, it is named after a Hungarian guy called Elo). It seems to be a bit better than Stockfish level 1 and a bit worse than Stockfish level 2 from the bar graph.

Based on what we know I think its not surprising these models can learn to play chess, but they get absolutely smoked by a "real" chess bot like Stockfish or Leela.

golol · on Jan 7, 2024

Afaik his small bot reaches 1300 and gpt-3.5-instruct reaches 1800. We have no idea how much and on what kind of PGNs the Openai model was trained. I heard a rumor that they specifically trained on games up to 1800 before but no idea.

jhrmnn · on Jan 7, 2024

They also say “I left one training for a few more days and it reached 1500 ELO.” I find it quite likely the observed performance is largely limited by the spent compute.

foota · on Jan 7, 2024

I can't see it being superhuman, that's for sure. Chess AI are superhuman because they do vast searches, and I can't see that being replicated by an LLM architecture.

WithinReason · on Jan 7, 2024

The apples-to-apples comparison would be comparing an LLM with Leela with search turned off (only using a single board state)

According to figure 6b [0] removing MCTS reduces Elo by about 40%, scaling 1800 Elo by 5/3 gives us 3000 Elo which would be superhuman but not as good as e.g. LeelaZero.

[0]: https://gwern.net/doc/reinforcement-learning/model/alphago/2...

sscg13 · on Jan 7, 2024

Leela policy is around 2600 elo, or around the level of a strong grandmaster. Note that Go is different from chess since there are no draws, so skill difference is greatly magnified. Elo is always a relative scale (expected score is based on elo difference) so multiplication should not really make sense anyways.

edgyquant · on Jan 7, 2024

I don’t think 3000 is superhuman though, it’s peak human as iirc magnus had an Elo of 3000 at one point

macrolime · on Jan 7, 2024

Any particular reason why that shouldn't work well with fine-tuning of an LLM using reinforcement learning?

adrianN · on Jan 7, 2024

Chess AI used to dominate by computational power but to my knowledge that is no longer true and the engines beat all but the very strongest players even when run on phone CPUs.

baq · on Jan 7, 2024

Phone cpus have gotten quite fast in the past decade, too.

adrianN · on Jan 7, 2024

Deep Blue analyzed some 200 million positions per second. Modern engines analyze a three to four orders of magnitude fewer nodes per second, but have much more refined pruning of the search space.

baq · on Jan 8, 2024

Point taken.

Positions analyzed per $, per W and per watt-dollar are surely much, much higher, though ;)

foota · on Jan 8, 2024

20 thousand positions per second is still a lot compared to a human though.

jksk61 · on Jan 7, 2024

> What's the strength of play for the GPT architecture? It's impressive that it figures out the rules, but does it play strong chess?

sometimes it is not a matter of "is it better? is it larger? is it more efficient?", but just a question.

mountains are mountains, men are men.

bananapub · on Jan 7, 2024

> can intuit the rules of the game from those examples,

I am pretty sure a bunch of matrix multiplications can't intuit anything.

naively, it doesn't seem very surprising that enormous amounts of self play cause the internal structure to reflect the inputs and outputs?

nerdponx · on Jan 7, 2024

It's not self-play. It's literally just reading sequences of moves. And it doesn't even know that they're moves, or that it's supposed to be learning a game. It's just learning to predict the next token given a sequence of previous tokens.

What's kind of amazing is that, in doing so, it actually learns to play chess! That is, the model weights naturally organize into something resembling an understanding of chess, just by trying to minimize error on next-token prediction.

It makes sense, but it's still kind of astonishing that it actually works.

golol · on Jan 7, 2024

> I am pretty sure a bunch of matrix multiplications can't intuit anything.

I don't understand how people can say things like this when universal approximation is an easy thing to prove. You could reproduce Magnus Carlsen's exact chess-playing stochastic process with a bunch of matrix multiplications and nonlinear activations, up to arbitrarily small error.

ben_w · on Jan 7, 2024

I read such statements as being claims that "intuition" is part of consciousness etc.

It's still too strong a claim given that matrix multiplication also describes quantum mechanics and by extension chemistry and by extension biology and by extension our own brains… but I frequently encounter examples of mistaking two related concepts for synonyms, and I assume in this case it is meant to be a weaker claim about LLMs not being conscious.

Me, I think the word "intuition" is fine, just like I'd say that a tree falling in a forest with no one to hear it does produce a sound because sound is the vibration of the air instead of the qualia.

golol · on Jan 7, 2024

Funnily, for me intuition is the part of intelligence which I can more easily imagine as being done by a neural network. When my intuition says this person is not to trust I can easily imagine that being something like a simple hyperplane classification in situation space.

It's the active, iterative thinking and planning that is more critical for AGI and, while obviousky theoretically possible, much harder to imagine a neural network performing.

edgyquant · on Jan 7, 2024

No, matrix multiplication is the system humans use to make predictions about those things but it doesn’t describe their fundamental structure and there’s no reason to imply they do.

ben_w · on Jan 7, 2024

> describe their fundamental structure

That is literally, literally, what it does.

One may argue that it does so wrongly, but that's a different claim entirely.

> there’s no reason to imply they do

The predictions matching reality to the best of our collective abilities to test them is such a reason.

The saying that "all models are wrong but some are useful" is a reason against that.

FreakLegion · on Jan 7, 2024

This simply isn't true. There are big caveats to the idea that neural networks are universal function approximators (as there are to the idea that they're universal Turing machines, which also somehow became common knowledge in our post-ChatGPT world). The function has to be continuous, we're talking about functions rather than algorithms, an approximator being possible and us knowing how to construct it are very different things, and so on.

golol · on Jan 7, 2024

>The function has to be continuouss.

That's not a problem. You can show that neural network induced functions are dense in a bunch of function spaces, just like continuous functions. Regularity is not a critical concern anyways.

>functions vs algorithms

Repeatedly applying arbitrary functions to a memory (like in a transformer) yields you arbitrary dynamical systems, so we can do algorithms too.

> an approximator being possible and us knowing how to construct it are very different things,

This is of course the critical point, but not so relevant when asking whether something is theoretically possible. The way I see it this was the big question for deep learning and over the last decade the evidence has just continually grown that SGD is VERY good at finding weights that do in fact generalize quite well and that don't just approximate a function from step-functions the way you imagine an approximation theorem to construct it, but instead efficiently find features in the intermediate layers and use them for multiple purposes, etc. My intuition is that the gradient in high dimension doesn't just decrease the loss a bit in the way we imagine it for a low dimensional plot, but in those high dimensions really finds directions that are immensely efficient at decreasing loss. This is how transformers can become so extremely good at memorization.

jddj · on Jan 7, 2024

We really need a list of verbs we're allowed to use when talking about computers and verbs that belong in the magic human/animal-only section

sandgiant · on Jan 7, 2024

You are probably joking, but I think it's actually very important to look at the language we use around LLMs, in order not to get stuck in assumptions and sociological bias associated with a vocabulary usually reserved for "magical" beings, as it were.

This goes both ways by the way. I could be convinced that LLMs can achieve something the likes of intuition, but I strongly believe that it is a very different kind of intuition than we normally associate with humans/animals. Usins the same label is thus potentially confusing, and (human pride aside) might even prevent us from appreciating the full scope of what LLMs are capable of.

jddj · on Jan 7, 2024

I think the issue is that we're suddenly trying to pin down something that was previously fine being loosely understood, but without any new information.

If someone came to the table with "intuition is the process of a system inferring a likely outcome from given inputs by the process X - not to be confused with matmultuition which is process Y", that might be a reasonable proposal.

jstummbillig · on Jan 7, 2024

> naively, it doesn't seem very surprising that enormous amounts of self play cause the internal structure to reflect the inputs and outputs?

Right. Wait, are you talking about AI or humans?

empath-nirvana · on Jan 7, 2024

Can a bunch of neurons firing based on chemical and electrical triggers intuit anything? It has to be the case that any intelligent process must be the emergent result of non-intelligent processes, because intelligence is not an inherent property of anything.

jhrmnn · on Jan 7, 2024

What does „intuit“ mean to you then?

wredue · on Jan 7, 2024

I think that “intuit the rules” is just projecting.

More likely, the 16 million games just has most of the piece move combinations. It does not know a knight moves in an L. It knows from each square where a knight can move based on 16 million games.

edgyquant · on Jan 7, 2024

No this isn’t likely. Chess has trillions of possible games[1] that could be played and if it all it took was such a small number of games to hit most piece combinations chess would be solved. It has to have learned some fundamental aspects of the game to achieve the rating stated ITT

1. https://en.m.wikipedia.org/wiki/Shannon_number#:~:text=After....

wredue · on Jan 7, 2024

It doesn’t take the consumption of all trillions of possible game states to see a majority of possible ways a piece can move from one square to another.

Maybe I misread something as I only skimmed, but the pretty weak Elo would most definitely suggest a failure of intuiting rules.

rmorey · on Jan 7, 2024

no, a weak elo just indicates poor play. he also quantifies what percent of moves the model makes which are legal, and it’s ~99%, meaning it must have learned the rules

wredue · on Jan 7, 2024

My kids also make 99% legal moves and don’t know much more than how the pieces move.

You’re really wishing a lot more in to AI than is actually there.

rmorey · on Jan 7, 2024

That's entirely my point: both your kids and ChessGPT know the rules, but still don't play very strongly. You say they "don’t know much more than how the pieces move" but that's exactly what the rules are, how the pieces are allowed to move, given the sequence of moves that have come before (i.e the state of the board.) I'm saying ChessGPT is a poor player, and didn't learn much high level play. But it definitely learned the rules!

Terretta · on Jan 7, 2024

So if not artificial, what form of intelligence have the kids reached and is it any more or less impressive?

Certainly a lot of folks wish more into kids than is really there.

btown · on Jan 7, 2024

On a board with a finite number of squares, is this truly different?

The representation of the ruleset may not be the optimal Kolmogorov complexity - but for an experienced human player who can glance at a board and know what is and isn’t legal, who is to say that their mental representation of the rules is optimizing for Kolmogorov complexity either?

baq · on Jan 7, 2024

You assert something that is a hypothesis for further research in the area. Alternative is that it in fact knows that knights move in an L-shaped fashion. The article is about testing hypotheses like that, except this particular one seems quite hard.

HarHarVeryFunny · on Jan 7, 2024

It'd seem surprising to me if it had really learnt the generalization that knights move in an L-shaped fashion, especially since it's model of the board position seems to be more probabilistic than exact. We don't even know if it's representation of the board is spatial or not (e.g. that columns a & b are adjacent, or that rows 1 & 3 are two rows apart).

We also don't know what internal representations of the state of play it's using other than what the author has discovered via probes... Maybe it has other representations effectively representing where pieces are (or what they may do next) other than just the board position.

I'm guessing that it's just using all of it's learned representations to recognize patterns where, for example, Nf3 and Nh3 are both statistically likely, and has no spatial understanding of the relationship of these moves.

I guess one way to explore this would be to generate a controlled training set where each knight only ever makes a different subset of it's legal (up to) 8 moves depending on which square it is on. Will the model learn a generalization that all L-shaped moves are possible from any square, or will it memorize the different subset of moves that "are possible" from each individual square?

pama · on Jan 7, 2024

A minor detail here is that the analysis in the blog shows that the linear model built/trained on the the activations of an internal layer has a representation of the board that is probabilistic. Of course the full model is also probabilistic by design, though it probably has a better internal understanding of the state of the board than the linear projection used to visualize/interpret the internals of the model. There is no real meaning in the word "spatial" representation beyond the particular connectivity of the graph of the locations, which seems to be well understood by the model as 98% of the moves are valid, and this includes sampling with whatever probabilistic algorithm of choice that may not always return the best move of the model.

A different way to test the internal state of the model would be to score all possible valid and invalid moves at every position and see how the probabilities of these moves would change as a function of the player's ELO rating. One would expect that invalid moves would always score poorly independent of ELO, whereas valid moves would score monotonically with how good they are (as assessed by Stockfish) and that the player's ELO would stretch that monotonic function to separate the best moves from the weakest moves for a strong player.

HarHarVeryFunny · on Jan 7, 2024

> There is no real meaning in the word "spatial" representation beyond the particular connectivity of the graph of the locations

I don't think it makes sense to talk of the model (potentially) knowing that knights make L-shaped moves (i.e. 2 squares left or right, plus 1 square up or down, or vice versa) unless it is able to add/subtract row/column numbers to be able to determine the squares it can move to on the basis of this (hypothetical) L-shaped move knowledge.

Being able to do row/column math is essentially what I mean by spatial representation - that it knows the spatial relationships between rows ("1"-"8") and columns ("a"-"h"), such that if it had a knight on e1 it could then use this L-shaped move knowledge to do coordinate math like e1 + (1,2) = f3.

I rather doubt this is the case. I expect the board representation is just a map from square name (not coordinates) to piece on that square, and that generated moves likely are limited to those it saw the piece being moved make when it had been on the same square during training - i.e. it's not calculating possible, say, knight destinations base on an L-shaped move generalization, but rather "recalling" a move it had seen during training when (among other things) it had a knight on a given square.

Somewhat useless speculation perhaps, but would seem simple and sufficient, and an easy hypothesis to test.

baq · on Jan 8, 2024

IMHO this would be a publishable result.

empath-nirvana · on Jan 7, 2024

I think one thing the conversation around LLMs has shown is how poorly defined words like "know" are.