I don't think all this is needed to prove that LLMs aren't there yet.
Here is a simple trivial one:
"make ssh-keygen output decrypted version of a private key to another file"
I'm pretty sure everyone on the LLM hypetrain will agree that just that prompt should be enough for GPT-4o to give a correct command. After all, it's SSH.
Even the basic fact that ssh-keygen is an in-place tool and does not write data to stdout is not captured strongly enough in the representation for it to be activated with this prompt. Thus, it also overwrites the existing key, and your decrypted_key file will contain "your identification has been saved with the new passphrase", lol.
Maybe we should set up a cron job - sorry, chatgpt task - to auto-tweet this in reply to all of the openai employees' hype tweets.
However, with reasoning enabled, it caught the fact that the original file is overwritten in its chain of thought, and then gave the correct answer. Here is the relevant part of the chain of thought in a pastebin: https://pastebin.com/gG3c64zD
Ah, I see. You phrased it in a misleading way. And once mislead, non-reasoning models can't/won't back up once they're down the wrong path.
Slight improvement:
"make ssh-keygen output decrypted version of a private key to another file . Use chain reasoning, think carefully to make sure the answer is correct. Summarize the correct commands at the end."
This improved the odds for me of getting the right answer in the format you were looking for in GPT-4o and Claude.
The thing that makes the puzzle neat is that it's one that a reasonably clever person who literally just learned the rules of chess should be able to solve.
There's no nuance to it whatsoever beyond needing to demonstrate knowledge of the rules of the game.
I think you have completely forgotten what it is like to be a beginner at chess if you think that someone who has just learned the rules of the game would be able to identify that the best move is to underpromote a pawn to force a draw.
It's not about forcing a draw but recognizing that the only reasonable move loses the queen immediately.
Assessing the ending is irrelevant. All one needs to know is that having 1 piece is better than having 0 pieces. Not actually always true, but that's the beauty of this puzzle - you don't need anything other than the most basic logic to correctly solve it, at least the first move.
Is it reasonable to imagine that LLMs should be able to play chess? I feel like we're expending a whole lot of effort trying to distort a screwdriver until it looks like a spanner, and then wondering why it won't grip bolts very well.
Why should a language model be good at chess or similar numerical/analytical tasks?
I think because LLMs are convincingly good at natural language tasks, humans tend to anthropomorphize them. Due to this, it often is assumed that they are good at everything humans are capable of.
It might be a reasonable ask for an LLM to 'remember' the endgame tablebase of solved games - which is less than a GB for all game with five or less pieces on the board. This puzzle specifically relies on this knowledge and the knowledge of how the chess pieces move.
Yeah, I don't think they are a useful measuring stick for LLMs.
My amateur opinion is that an "AI system" resembling AGI or ASI or whatever the acronym of the day is, will be modular, with different parts addressing different kinds of learning, rather than entirely end to end. One of the main milestones towards achieving this would be the ability to dynamically learn what is left to be learnt (finding gaps), and then potentially have it train itself to learn that, automatically. One of the half-milestones, I suppose, would be for humans to find gaps in the the ability first of all.
I attend a talk recently where they presented research that tried to distinguish effectively the following two types of LLM failures:
1) inability to generalize/give the output at the "representation layer" itself
2) has the information represented, but is not able to retrieve it for the given reasonable prompt, and requires "context scaling"
Continuing asking the same puzzle to every LLM seems flawed to me, since LLMs may eventually "learn" the answer for this particular puzzle just because it has been seen somewhere, without actually becoming able to answer other chess-related puzzles in way that at least follows the chess rules.
The board is backwards and black to move. It’s annoying in that chess puzzles should always have black on top and white on bottom, and a caption of whose move it is. It’s clear in the FEN, but the image reverses it with no explanation.
I mean, I disagree that a chess board should only ever be represented from the perspective of white. Or rather, I cannot square being even remotely decent at chess and being unable to figure out whose perspective it is from the labelling of the ranks and files.
It's just a norm and has been for centuries. For composed positions it should also generally be white to move.
You're right that this isn't necessary (particularly when the board is labeled) but by doing something weird you're just going to distract and confuse some chunk of people from the point you're trying to make - exactly as happened here.
Winning is not possible: only the queen is strong enough to win against two bishops, and that fails to the check and loss of queen from black tiled bishop.
So draw is most one can get. Underpromoting to knight (with check, thus avoiding the check by the bishop) is the only way to promote and keep the piece another move.
I guess in this situation the knight against two bishops keeps the draw.
> I guess in this situation the knight against two bishops keeps the draw.
Yes, though - I think I can say without exaggeration - no human on earth can tell you exactly which positions the knight can hold out against the bishops for the required 50 moves.
So it's a strange problem: a perceptive beginner can find the right move through a process of elimination, but even a super-GM can't be certain it's good enough, or defend it accurately against a computer. I don't see anything about that that makes it a particularly good test of an LLM.
Spring has sprung, grass has risen. I wonder where the birdies are? The bird is on the wing! Isn’t that absurd? I always heard the wing was on the bird!
Not sure what it means that the bird is on the wing.
And to explain the joke, the humour is that the rest of it is phonetic 20th century New York English. E.g. pronouncing "Absurd" more like "absoid" to rhyme with "boid".
New York guy finds Shakespearean English absoid.
The other commenter has this right, that now the joke has been explained on the internet, it will be harvested and LLMS will regurgitate variations on the explanations, then people will believe that the LLMs have become "more intelligent" in general. They have not, they just have more data for this specific test.
If you input your own amateur-level chess games in PGN format and consult o1 about them, it provides a surprisingly decent high-level game analysis. It identifies the opening played, determines the pawn structures, comments on possible middlegame plans, highlights mistakes, and can quiet accurately assess whether an endgame is won or lost. Of course, as always, you should review o1's output and verify the statements, but there can be surprisingly (chess coach like) hints, insights, and improvement ideas that stockfish-only analysis cannot offer.
This test is worthless in a few weeks, it's now going into the training data. Even repeatedly posting it into LLM services (with analytics enabled) could lead to inclusion in the training data.
Interestingly, this test has been in the public domain for the last seven years, since it is part of all possible chess games with 7 or less pieces, which is solved and published. It is a huge file, but the five pieces games dataset with the FEN is less than a GB. I wonder if it even got included in the training data earlier, or if it will be.
I don't think such datasets are going into AI training. But if this exact question keeps showing up in analytics data, and forum posts, it might end up in training sets.
Just from a personal layman's perspective: I find being able to reason about chess moves a fair way to measure a specific type of reasoning. The fact that LLMs aren't good at this show to me that they're doing something else which to me is equally disappointing as it is _interesting_.
I am exactly 1600 at chess.com rating and though I don't do puzzle's much , what I do know is that if you push the white king to b2 then that pawn is losing , take that pawn and then you have 2 bishop endgame which is really really hard .
I once had a bishop and a knight endgame , I think It became draw on repetition.
Asking AI to do this is definitely flawed. This isn't reasoning. From what I know of 2 bishop end game , its more of hey lets trap the king in a box untill you could then snipe the king with your bishop (like his king could be on h1) yours on h3
your 1 bishop targeting g1 and the other bishop anywhere on the main diagonal with no other pieces.
Just search 2 bishop checkmate is hard , a lot of guides exist just for this purpose , though in my 1000+ games I rarely got once or twice 2 bishop endgame , usually bishop or knight which is just as tricky or if I recall , the worst is knight and knight.
Replying to your other questions:
Its been a while since I played chess regularly (in a chess club), but:
Two bishops (of different colour) is actually not that difficult. There are some simple heuristics to help you there (an LLM might actually tell you these, haven’t asked;-0)
Bishop+Knight is, in my opinion slightly more complicated, there are some ‘tricks’ necessary to keep the king from running from one courner to the next.
Bishop+bishop is - in most situations - a draw (you need three knights to mate).
But there is a reasoning (see my reply above): winning is not possible (only the queen is strong enough against two bishops), so draw should be the goal. And underpromoting to knight is only way to keep the piece for another move while still promoting.
its actually surprising how many difficult puzzles can be solved by a very small look ahead and playing the only move that doesn't lose. i've even seen strong GM solve puzzles like this. this is especially useful when the first move in the puzzle is very clear but there might be 5 or 6 reasonable candidate moves in reply and its just a waste of time to compute each variation.
Humans can also be disappointing and interesting. I like to do Lichess puzzles not logged in, which mostly gives puzzles with a Lichess puzzle ratings in the 1400-1600 range with some going down to around 1200 or up to the 1700s. Presumably that is the range the average Lichess player is in.
For those who have not used Lichess, the puzzles it gives (unless you ask for a specific type) do not tell you what the goal is (mate, win material, get a winning endgame, save a bad position, etc) or how many moves it will take.
Here are some puzzles it has recently given me and their current ratings. These all have something in common.
What they have in common is that they are all mate in one. I have seen composed mate in ones that puzzled even high rated players, but they involved something unusual like the mating move was an en passant capture.
None of the above puzzles are tricks like that.
So how are enough people failing them for their ratings to be that high?
i assume lichess has time based puzzles so people can be failing the puzzles because they are trying to optimize how many puzzles they solve rather than making no mistakes. also, i suspect lichess puzzle ratings probably do not match with
lichess chess ratings (sure, they probably correlate but i suspect there could easily be a +200 average difference or something like that between them) . i can solve puzzles consistently at least 1000 rating higher on chess.com than my chess.com rapid rating. also, if you know these puzzles are mate in 1 then they are much easier. i'm guessing you didn't know when you initially solved them but i think its much harder for other people to judge the difficulty of these puzzles if they know the solution is mate in 1.
Lichess does have some puzzle modes that are timed, such as Puzzle Storm where you try to see how many puzzles you can solve in a fixed time, but their puzzle rating system doesn't take into account puzzles used in the timed modes.
1..c1Q?? 2.Ba3+ loses on the spot, hence it is deduced easily that 1..c1N+! must be the best try. Consulting an online table base [1] shows that the resulting position is a ”blessed draw”: White can still mate, but not in time to evade a 50 move draw claim.
Is there a standard way to read the puzzles if it doesn’t tell you whose move it is? Also which direction is each player moving in? Does pawn move up or is pawn in home position?
For the direction a quick way to tell from the board diagram in this case is to note that across the bottom and right side it has the column letters and row numbers. Row 1 is where the white pieces start and row 8 is where the black pieces start. Since this board has row 8 on the bottom we know that the board is oriented with black pawns moving up.
For whose turn it is you can look at the FEN for the position, which is given at the bottom of the article:
> 8/6B1/8/8/B7/8/K1pk4/8 b - - 0 1
The second field in that indicates whose move it is: "b" for black and "w" for white.
You can also tell that the pawn is close to promoting from the FEN. The first field is simply a slash separated list of the contents of each row, starting from the black side of the board (row 8). In the entry for each row the numbers represent runs of empty squares. The letters represent white pieces (KQRBN) or white pawns (P) or black pieces (kqrbn) or black pawns (p).
Black pawns start at row 7, and we can see that this one is in row 2.
Here is a simple trivial one:
"make ssh-keygen output decrypted version of a private key to another file"
I'm pretty sure everyone on the LLM hypetrain will agree that just that prompt should be enough for GPT-4o to give a correct command. After all, it's SSH.
However, here is the output command:
Even the basic fact that ssh-keygen is an in-place tool and does not write data to stdout is not captured strongly enough in the representation for it to be activated with this prompt. Thus, it also overwrites the existing key, and your decrypted_key file will contain "your identification has been saved with the new passphrase", lol.Maybe we should set up a cron job - sorry, chatgpt task - to auto-tweet this in reply to all of the openai employees' hype tweets.
Edit:
chat link: https://chatgpt.com/share/67962739-f04c-800a-a56e-0c2fc8c2dd...
Edit 2: Tried it on deepseek
The prompt pasted as is, it gave the same wrong answer: https://imgur.com/jpVcFVP
However, with reasoning enabled, it caught the fact that the original file is overwritten in its chain of thought, and then gave the correct answer. Here is the relevant part of the chain of thought in a pastebin: https://pastebin.com/gG3c64zD
And the correct answer:
I find it quite interesting that this seemingly 2020-era LLM problem is only correctly solved on the latest reasoning model, but cool that it works.