Hacker News new | past | comments | ask | show | jobs | submit login
Something weird is happening with LLMs and Chess (dynomight.net)
140 points by gregorymichael 38 days ago | hide | past | favorite | 57 comments



Is gpt-3.5-turbo-instruct function calling a chess-playing model instead of generating through the base LLM?

This is not "cheating" in my opinion... in general better for LLMs to know when to call certain functions, etc.


I don't know... It's like claiming that Samsung "enhanced their phone camera abilities" when they replaced zoomed-in moon shots with hi-res images of the moon.

https://www.samsungmobilepress.com/feature-stories/how-samsu...


I think that's meaningfully different. If you ask for chess advice, and get chess advice, then your request was fulfilled. If you ask for your photo to be optimized, and they give you a different photo, they haven't fulfilled your request. If GPT was giving Go moves instead of Chess moves, then it might be a better comparison, or maybe generating random moves. The nature of the user's intent is just too different.


>it's like claiming that Samsung "enhanced their phone camera abilities" when they replaced zoomed-in moon shots with hi-res images of the moon

to be fair, the human visual system does the same


It's cheating to the extent that it misrepresents the strength and reasoning ability of the model, to the extent that anyone is going to look at it's chess playing results and incorrectly infer this says anything about how good the model is.

The takeaway here is that if you are evaluating different models for your own use case, the only indication of how useful each may be is to test it on your actual use case, and ignore all benchmarks or anything else you may have heard about it.


It represents the reasoning ability of the model to correctly choose and use a tool... Which seems more useful than a model that can do chess by itself but when you need it to do something else, it keeps playing chess.


Where it’ll surprise people is if they don’t realize it’s using an external tool and expect it to be able to find solutions of similar complexity to non-chess problems, or if they don’t realize this was probably a special case added to the program and that this doesn’t mean it’s, like, learned how to go find and use the right tool for a given problem in a general case.

I agree that this is a good way to enhance the utility of these things, though.


It doesn't take much to recognize a sequence of chess moves. A regex could do that.

If what you want is intelligence and reasoning, there is no tool for that - LLMs are as good as it gets for now.

At the end of the day it either works on your use case, or it doesn't. Perhaps it doesn't work out of the box but you can code an agent using tools and duct tape.


Do you really think it's feasible to maintain and execute a set of regexes for every known problem every time you need to reason about something? Welcome to the 1970s AI winter...


No I don't - I'm saying that tool use is no panacea, and availability of a chess tool isn't going to help if what YOU need is a smarter model.


Sure, but how do you train a smarter model that can use tools, without first having a less smart model that can use tools? This is just part of the progress. I don't think anyone claims this is the endgame.


I really don't understand what point you are trying to make.

Your original comment about a model that might "keep playing chess" when you want it to do something else makes no sense. This isn't how LLMs work - they don't have a mind of their own, but rather just "go with the flow" and continue whatever prompt you give them.

Tool use is really no different than normal prompting. Tools are internally configured as part of the hidden system prompt. You're basically just telling the model to use a specific tool in specific circumstances, and the model will have been trained to follow instructions, so it does so. This is just the model generating the most expected continuation as normal.


"Is gpt-3.5-turbo-instruct function calling a chess-playing model instead of generating through the base LLM?"

I'm absolutely certain it is not. gpt-3.5-turbo-instruct is one of OpenAI's least important models (by today's standard) - it exists purely to give people who built software on top of the older completion models something to port their code to (if it doesn't work with instruction tuned models).

I would be stunned if OpenAI had any special-case mechanisms for that model that called out to other systems.

When they have custom mechanisms - like Code Interpreter mode - they tell you about them.

I think it's much more likely that something about instruction tuning / chat interferes with the model's ability to really benefit from its training data when it comes to chess moves.


It should be easy to test for. An LLM playing chess itself tries to predict the most likely continuation of a partial game it is given, which includes (it has been shown) internally estimating the strength of the players to predict equally strong or weak moves.

If the LLM is just pass through to a chess engine, then it more likely to play at the same strength all the time.

It's not clear in the linked article how many moves the LLM was given before being asked to continue, or if these were all grandmaster games. If the LLM still crushes it when asked to continue a half played poor quality game, then that'd be a good indication it's not an LLM making the moves (since it would be smart enough to match the poor quality of play).


This this the point!

LLMs have this unique capability. Yet, every AI company seems hell bent on making them... not have that.

I want the essence of this unique aspect, but better, not this unique aspect diluted with other aspects such as the pure logical perfection of ordinary computer software. I already have that!

The problem with every extant AI company is that they're trying to make finished, integrated products instead of a component.

It's as-if you just wanted a database engine and every database vendor insisted on selling you a shopfront web app that also happens to include a database in there somewhere.


If it read the CLI docs it might just make the right calls (x —-ELO:1400).


If that's what it does, then it's "cheating" in the sense that people think they're interacting with an LLM, but they're actually interacting with an LLM + chess engine. This could give the impression that LLM's are able to generalize to a much broader extent than they actually are – while it's actually all just a special-purpose hack. A bit like putting invisible guard rails on some popular difficult test road for self-driving cars – it might lead you to think that it's able to drive that well on other difficult roads.


Calling out to some chess-playing-function would be a deviation from the pure LLM paradigm. As a medium-level chess player I have walked through some of the LLM victories (ChatGPT 3-5-turbo-instruction); I find it is not very good at winning by mate - it misses several chances of forced mate. But forced mate is what chess engines are good at - can be calculated by exhaustive search of valid moves in a given board position.

So I'm arguing that it doesn't call out - it should gotten better advice if it did.

But I remain amazed that OP does not report any illegal moves made any of by LLMs. Assuming training material includes introductory texts of chess playing and a lot of chess games in textual notation (e.g. PGN) I would expect at least occasional illegal moves since the rules are defined in terms of board positions. And board positions are a non-trivial function of the set of moves made in a game. Does an LLM silently perform a transformation of the set of moves to a board position? Can LLMs, during training, read and understand board-position diagrams of chess books?


> But I remain amazed that OP does not report any illegal moves made any of by LLMs.

They did (but not enough detail to know how much of an impact it had):

> For the open models I manually generated the set of legal moves and then used grammars to constrain the models, so they always generated legal moves. Since OpenAI is lame and doesn’t support full grammars, for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly.


I don't think it is, since OpenAI never mentions that anywhere AFAIK. That would be a really niche feature to include and then drop instead of building on more.


gpt 3.5 has had function calling capability since July 2023 (for user). [1]

Yes, they have never mentioned that the 3.5 model already does this in the back-end for certain features.

Anyone at OpenAi care to comment... not a particularly controversial topic.

[1] https://openai.com/index/function-calling-and-other-api-upda...


That was also my first thought. The discrepancy is just too large to be the mere result of a transformer model fed more chess data.


That seems the most likely scenario to me.

Helping that along is that it's an obvious scenario to optimize, for all kinds of reasons. One of them being that it is a fairly good "middle of the road" test for integrating with such systems; not as trivial as "Let's feed '1 + 1' to a calculator" and nowhere near as complicated as "let's simulate an entire web page and pretend to click on a thing" or something.


Why would they only incorporate a chess engine into (seemingly) exactly one very old, dated model? The author tests o1-mini and gpt-4o. They both fail at chess.


Because they decided it wasn't worth the effort. I can point to any number of similar situations over the many years I've been working on things. Bullet-point features that aren't pulling their weight or are no longer attracting the hype often don't transition upgrades.

A common myth that people have is that these companies have so much money they can do everything, and then they're mystified by things like bugs in Apple or Microsoft projects that survive for years. But from any given codebase, the space of "things we could do next" is exponential. That defeats any amount of money. If they're considering porting their bespoke chess engine code up to the next model, which absolutely requires non-trivial testing and may require non-trivial work, even for the richest companies in the world it is still an opportunity cost and they may not choose to spend their time there.

I'm not saying this is the situation for sure; I'm saying that this explanation is sufficient that I'm not going "oh my gosh this situation just isn't possible". It's definitely completely possible and believable.


Based on looking at the games at the end of the post, it seems unlikely. Both sides play extremely poorly — gpt-instruct is just slightly less bad — and I don't see any reasonable engine outputting those moves.


If the goal is to produce a LLM-like interface that generates correct output, then sure, it's not cheating..... but is it really a data-driven LLM at that point? If the LLM amounts to a chat-frontend that calls a host of human-prepared programs or draws from human-prepared databases, etc, it's starting to sound a lot more like Wolfram Alpha v2 than a LLM, and strikes me as walking away from AGI rather than toward it


There is an obvious difference on the openai side. gpt-3.5-turbo-instruct is the only remaining decent model with raw text completion API access (RIP text-davinci-003 and code-davinci-002). All the others are only available in an abstract fashion through the wrapper that is the "system/role" API.

I still use gpt-3.5-turbo-instruct a lot because the raw text completion is so much more powerful than the system/role abstraction. With the system/role abstraction you literally cannot present the text you want to the model and have it go. It's always wrapped in openai-junk prompt you can't see or know about (and one that allows openai to cache their static pre-prompts internally to better share resources versus just allowing users to decide what the model sees).


This intensifies my theory that some of the older OAI models are far more capable than advertised but in ways that are difficult to productize.

How unlikely is it that in training of these models you occasionally run into an arrangement of data & hyperparameters that dramatically exceeds the capabilities of others, even if the others have substantially more parameters & data to work with?


It could also be as simple as OAI experimenting on different datasets. Perhaps Chess games were included in some GPT-3.5 training runs in order to see if training on chess would improve other tasks. Perhaps afterwards it was determined that yes, LLMs can play chess - but no let's not spend time/compute on this.


Would be a shame, because chess is an excellent metric for testing logical thought and internal modeling. An LLM that can pick up and unique chess game half way through and play it ideally to completion is clearly doing more than "predicting the next token based on the previous one".


> chess is an excellent metric for testing logical thought and internal modeling

Is it, though? Apparently nobody else cared to use it to benchmark LLMs until this article.


People had noticed this exact same discrepancy between 3.5-turbo-instruct and 4 a year ago: https://x.com/GrantSlatton/status/1703913578036904431


Clearly the performance of the "instruct" LLM is due to some odd bug or other issue. I do not believe that it is fundamentally better at chess than the others, even if it was specifically trained on far more chess data, which is unlikely. Lack of correlation is also not causation.


Perhaps the tokenizer is the entire problem? I wonder how the chat-based LLMs would perform if you explained the move with text, rather than chess notation? I can easily imagine that the "instruct" LLM uses a specialized tokenizer designed to process instructions, similar in many ways to chess notation.


The tokenizer does seem like it has a serious drawback for Chess notation.

The model is forced to lock in the column choice before the row choice when using chess notation. It can't consider the moves as a whole, and has to model longer range dependencies to accurately predict the best next move.... But it may never let the model choose the 2nd best move for that specific situation because of that.


> Theory 2: GPT-3.5-instruct was trained on more chess games.

Alternatively somebody who prepared training materials for this specific ANN had some spare time and decided to preprocess them so that during training the model was only asked to predict movements of the winning player and that individual whimsy was never repeated in training of any other model.


Having seen bit rot in action, I totally buy this explanation. Some PhD did this on their spare time and then left and when it didn't work in the gpt-4.0 training branch, it just got commented out by someone else and then forgotten.


Wouldn't surprise me if the outcome would have been exactly the same if instead of LLM's this was tried with a random selection of human beings :-)


I don't think LLMs are suitable for chess. Witness the result of a chess game this year between ChatGPT (OpenAI) and Gemini (Google AI). The LLMs had a very poor record at picking legal moves, let alone good moves:

"Gemini's mistakes and ChatGPT's misses tell a lot of the story. One AI kept giving the other one opportunities, the other kept refusing those gifts. The good news for ChatGPT is that it made more "Good" or better moves than Gemini made mistakes and blunders. The bad news for Gemini is, well, almost everything that happened. ...

The final illegal move tally was Gemini 32, ChatGPT 6. That makes sense; it would have been crazy if the AI good enough to win was also bad enough to make more illegal moves. But it also means Gemini only went 50% in picking a legal move, while ChatGPT was over 80%."

https://www.chess.com/article/view/chatgpt-gemini-play-chess


A recent discovery showing that computer chess (which was traditionally based on searching a combinatorial space, i.e. NP-hard) that is instead now being solved with transformer models, is actually playing better at the ELO level.

https://arxiv.org/pdf/2402.04494

If you think about using search to play chess, it can go several ways.

Brute-forcing all chess moves (NP hard) doesn’t work because you need almost infinite compute power.

If you use a chess engine with clever heuristics to eliminate bad solutions, you can solve it in finite time.

But if you learn from the best humans performing under different contexts (transformers are really good at capturing context in sequences and predicting the next token from that context — hence their utility in LLMs) you have narrowed your search space even further to only set of good moves (by grandmaster standards).


The news that got my attention was https://arxiv.org/html/2402.04494v1 "Grandmaster-Level Chess Without Search" and an engine using this approach exceeding 2900 on Lichess (against humans) and over 2200 against engines.

I actually started a chess channel a couple of hourse ago to help humans take advantage of this (nothing there yet https://youtube.com/@DecisiveEdgeChess ).

I have long taught my students that its possible to assess positions at a very high level with very, very little calculation, and this news hit me as "finally, evidence enough to intrigue more people that this is possible!" (My interest in Chess goes way back. I finished in the money in the U.S. Open and New York Open in the '80's, and one of my longtime friends was IM Mike Valvo, since passed, who was the arbiter for the 1996 match between Garry Kasparov and IBM's Deep Blue, and a commentator for the '97 match alongside Grandmasters Yasser Seirawan and Maurice Ashley.)


You might be interested in GM Matthew Sadler's work, he agrees with you. Here's a link:

https://www.newinchess.com/media/wysiwyg/product_pdf/9073.pd...

He's been arguing that "intuition", i.e., reading a position based on your understanding of the game and not on calculation, is a big deal.


Yep that paper is from the same team that produced the paper I linked to.

But yes the title is much more exciting - “grandmaster level chess without search”.



Anecdotally, I found the same in terms of art and text output in interview conversations with college students at a bar in my local area. The problem does not appear to be localized to chess output. Went something like:

"Wow, these Chatbots are amazing, look at this essay or image it made me!." (shows phone)

"Although, that next ChatGPT seems lobotomized. Don't know how to make it give me stuff as cool as what it made before."


I do wonder how much of this is the prompt. Gpt-4 was doing really well against me when it launched but my prompt was quite different


I wonder how a transformer (even an existing LLM architecture) would do if it was trained purely on chess moves - no language at all. The limited vocabulary would also be fantastic for training time, as the network would be inherently smaller.



My comment from substack:

Not criticizing the monocausal theories, but LLMs "do a bunch of stuff with a bunch of data" and if you ask them why they did something in particular, you get a hallucination. To be fair, humans will most often give you a moralized post hoc rationalization if you ask them why they did something in particular, so we're not far from hallucination.

To be more specific, the models change BOTH the "bunch of stuff" (training setup and prompts) and the "bunch of data", and those changes interact in deep and chaotic (as in chaos theory) ways.

All of this really makes me think about how we treat other humans. Training an LLM is a one-way operation, you can't really retrain one part of an LLM (as I understand it). You can do prompt engineering, and you can do some more training, but those interact and deep and chaotic ways.

I think you can replace LLM with human in the previous paragraph and not be too far wrong.


Strongly implied but not explicitly stated in here - all these LLMs were able to consistently generate legal moves? All the way to the end of a game?

Seems noteworthy enough in itself, before we discuss their performance as chess players.


Towards the end of the blog post the author explains that he constrained the generation to only tokens that would be legal. For the OpenAI models he generated up to 10 different outputs until he got one that was legal, or just randomly chose a move if it failed.


> For the OpenAI models he generated up to 10 different outputs until he got one that was legal, or just randomly chose a move if it failed.

I wonder how often they failed to generate a move. That feels like it could be a meaningful difference.


Gpt-3.5-turbo-instruct had something like 5(or less) illegal moves in 8205

https://github.com/adamkarvonen/chess_gpt_eval

I expect the rest to be much worse if 4's performance is any indication


And the most notable part of that:

> Most of gpt-4's losses were due to illegal moves

3.5-turbo-instruct definitely has some better chess skills.


Some people are good at playing chess, and others aren’t.


Should've tuned the LLM for chess.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: