Hacker News new | past | comments | ask | show | jobs | submit login
I ask this chess puzzle to every new LLM (gist.github.com)
25 points by thepoet 18 days ago | hide | past | favorite | 50 comments



I don't think all this is needed to prove that LLMs aren't there yet.

Here is a simple trivial one:

"make ssh-keygen output decrypted version of a private key to another file"

I'm pretty sure everyone on the LLM hypetrain will agree that just that prompt should be enough for GPT-4o to give a correct command. After all, it's SSH.

However, here is the output command:

  ssh-keygen -p -f original_key -P "current_passphrase" -N "" -m PEM -q -C "decrypted key output" > decrypted_key
  chmod 600 decrypted_key
Even the basic fact that ssh-keygen is an in-place tool and does not write data to stdout is not captured strongly enough in the representation for it to be activated with this prompt. Thus, it also overwrites the existing key, and your decrypted_key file will contain "your identification has been saved with the new passphrase", lol.

Maybe we should set up a cron job - sorry, chatgpt task - to auto-tweet this in reply to all of the openai employees' hype tweets.

Edit:

chat link: https://chatgpt.com/share/67962739-f04c-800a-a56e-0c2fc8c2dd...

Edit 2: Tried it on deepseek

The prompt pasted as is, it gave the same wrong answer: https://imgur.com/jpVcFVP

However, with reasoning enabled, it caught the fact that the original file is overwritten in its chain of thought, and then gave the correct answer. Here is the relevant part of the chain of thought in a pastebin: https://pastebin.com/gG3c64zD

And the correct answer:

  cp encrypted_key temp_key && \
  ssh-keygen -p -f temp_key -m pem -N '' && \
  mv temp_key decrypted_key
I find it quite interesting that this seemingly 2020-era LLM problem is only correctly solved on the latest reasoning model, but cool that it works.


Ah, I see. You phrased it in a misleading way. And once mislead, non-reasoning models can't/won't back up once they're down the wrong path.

Slight improvement:

"make ssh-keygen output decrypted version of a private key to another file . Use chain reasoning, think carefully to make sure the answer is correct. Summarize the correct commands at the end."

This improved the odds for me of getting the right answer in the format you were looking for in GPT-4o and Claude.

These things aren't magic oracles, they're tools.


What was misleading? It's a very reasonable prompt that contains all the information required to generate the rest of the answer.

I didn't ask or expect any format. The accurate answer in whatever format is all that is expected.


It is likely to match on the typical command line pattern of redirecting stdout.

The way I see it; If/when it does so, a non-reasoning model can't (as easily) detect that this is an error, turn back, and go back down another path.

The modified prompt improves the odds somewhat by making it easier to detect a problem early on and change course, but it's not a 100% guarantee.


It looks like o1 also gets the right answer after thinking about it for 14 seconds: https://chatgpt.com/share/67962ead-a5f8-800a-bd91-9a145b993e...


The thing that makes the puzzle neat is that it's one that a reasonably clever person who literally just learned the rules of chess should be able to solve.

There's no nuance to it whatsoever beyond needing to demonstrate knowledge of the rules of the game.


I think you have completely forgotten what it is like to be a beginner at chess if you think that someone who has just learned the rules of the game would be able to identify that the best move is to underpromote a pawn to force a draw.


It's not about forcing a draw but recognizing that the only reasonable move loses the queen immediately.

Assessing the ending is irrelevant. All one needs to know is that having 1 piece is better than having 0 pieces. Not actually always true, but that's the beauty of this puzzle - you don't need anything other than the most basic logic to correctly solve it, at least the first move.


Is it reasonable to imagine that LLMs should be able to play chess? I feel like we're expending a whole lot of effort trying to distort a screwdriver until it looks like a spanner, and then wondering why it won't grip bolts very well.

Why should a language model be good at chess or similar numerical/analytical tasks?

In what way does language resemble chess?


I think because LLMs are convincingly good at natural language tasks, humans tend to anthropomorphize them. Due to this, it often is assumed that they are good at everything humans are capable of.


Okay, but I'm not good at playing chess without seeing the chessboard. In fact I'm pretty awful at that.


It might be a reasonable ask for an LLM to 'remember' the endgame tablebase of solved games - which is less than a GB for all game with five or less pieces on the board. This puzzle specifically relies on this knowledge and the knowledge of how the chess pieces move.


LLM: given a sequence of words, what is the most likely next word

Chess engine: given a sequence of moves in a winning game, what is the most likely next move

I don't think LLMs will ever beat purpose built engines, but it is not inconcevable for them to play better chess than most humans.


Yeah, I don't think they are a useful measuring stick for LLMs.

My amateur opinion is that an "AI system" resembling AGI or ASI or whatever the acronym of the day is, will be modular, with different parts addressing different kinds of learning, rather than entirely end to end. One of the main milestones towards achieving this would be the ability to dynamically learn what is left to be learnt (finding gaps), and then potentially have it train itself to learn that, automatically. One of the half-milestones, I suppose, would be for humans to find gaps in the the ability first of all.

I attend a talk recently where they presented research that tried to distinguish effectively the following two types of LLM failures:

1) inability to generalize/give the output at the "representation layer" itself

2) has the information represented, but is not able to retrieve it for the given reasonable prompt, and requires "context scaling"

Which is a step towards this goal I suppose.


Continuing asking the same puzzle to every LLM seems flawed to me, since LLMs may eventually "learn" the answer for this particular puzzle just because it has been seen somewhere, without actually becoming able to answer other chess-related puzzles in way that at least follows the chess rules.


It's fine until the LLMs succeed. Then you need to try another one.


What’s the solution? I’m a human and I can’t come up with a specific move I would call „the correct move“.


The board is backwards and black to move. It’s annoying in that chess puzzles should always have black on top and white on bottom, and a caption of whose move it is. It’s clear in the FEN, but the image reverses it with no explanation.


I mean, I disagree that a chess board should only ever be represented from the perspective of white. Or rather, I cannot square being even remotely decent at chess and being unable to figure out whose perspective it is from the labelling of the ranks and files.


It's just a norm and has been for centuries. For composed positions it should also generally be white to move.

You're right that this isn't necessary (particularly when the board is labeled) but by doing something weird you're just going to distract and confuse some chunk of people from the point you're trying to make - exactly as happened here.


Note that it's black to move and black's pawn moves upwards. All move but one has instant counterplay from white.

Here's the board; you can enable the engine to get the answer: https://lichess.org/analysis/standard/8/6B1/8/8/B7/8/K1pk4/8...


Same!


Winning is not possible: only the queen is strong enough to win against two bishops, and that fails to the check and loss of queen from black tiled bishop.

So draw is most one can get. Underpromoting to knight (with check, thus avoiding the check by the bishop) is the only way to promote and keep the piece another move.

I guess in this situation the knight against two bishops keeps the draw.


> I guess in this situation the knight against two bishops keeps the draw.

Yes, though - I think I can say without exaggeration - no human on earth can tell you exactly which positions the knight can hold out against the bishops for the required 50 moves.

So it's a strange problem: a perceptive beginner can find the right move through a process of elimination, but even a super-GM can't be certain it's good enough, or defend it accurately against a computer. I don't see anything about that that makes it a particularly good test of an LLM.


I didn’t even know you could pick which piece you want to promote to. But I’m also not an average chess player.

So the correct move is to move the Pawn up and promote to knight?

Thanks!


My goto is ask LLM to explain this poem (this is but on variation) ...

   Spring has sprung, the grass iz riz,
   I wonder where da boidies iz?
   Da boid iz on da wing! Ain’t that absoid?
   I always hoid da wing...wuz on da boid!
chatgpt up to o1 failed, o1 did very well. deepseek-r1 7b did ok too.


I’m having a hard time with this one as a human.

Spring has sprung, grass has risen. I wonder where the birdies are? The bird is on the wing! Isn’t that absurd? I always heard the wing was on the bird!

Not sure what it means that the bird is on the wing.


it means "in flight"


> Not sure what it means that the bird is on the wing

Is this hard to find out? I mean it's easy to find this https://www.dictionary.com/browse/on-the-wing https://www.allgreatquotes.com/hamlet-quotes-114/

So it's somewhat old-timey English English.

And to explain the joke, the humour is that the rest of it is phonetic 20th century New York English. E.g. pronouncing "Absurd" more like "absoid" to rhyme with "boid".

New York guy finds Shakespearean English absoid.

The other commenter has this right, that now the joke has been explained on the internet, it will be harvested and LLMS will regurgitate variations on the explanations, then people will believe that the LLMs have become "more intelligent" in general. They have not, they just have more data for this specific test.


Apparently it's "boids" because the version that was popular in the UK came from New Jersey!


Claude 3.5 Sonnet did rather well.

I get the impression it often does as well or better than o1 on many tasks, despite not being a reasoning model.


If you input your own amateur-level chess games in PGN format and consult o1 about them, it provides a surprisingly decent high-level game analysis. It identifies the opening played, determines the pawn structures, comments on possible middlegame plans, highlights mistakes, and can quiet accurately assess whether an endgame is won or lost. Of course, as always, you should review o1's output and verify the statements, but there can be surprisingly (chess coach like) hints, insights, and improvement ideas that stockfish-only analysis cannot offer.


Maybe run through Stockfish first and then the LLM?


This test is worthless in a few weeks, it's now going into the training data. Even repeatedly posting it into LLM services (with analytics enabled) could lead to inclusion in the training data.


Interestingly, this test has been in the public domain for the last seven years, since it is part of all possible chess games with 7 or less pieces, which is solved and published. It is a huge file, but the five pieces games dataset with the FEN is less than a GB. I wonder if it even got included in the training data earlier, or if it will be.


I don't think such datasets are going into AI training. But if this exact question keeps showing up in analytics data, and forum posts, it might end up in training sets.


Just from a personal layman's perspective: I find being able to reason about chess moves a fair way to measure a specific type of reasoning. The fact that LLMs aren't good at this show to me that they're doing something else which to me is equally disappointing as it is _interesting_.


I am exactly 1600 at chess.com rating and though I don't do puzzle's much , what I do know is that if you push the white king to b2 then that pawn is losing , take that pawn and then you have 2 bishop endgame which is really really hard .

I once had a bishop and a knight endgame , I think It became draw on repetition.

Asking AI to do this is definitely flawed. This isn't reasoning. From what I know of 2 bishop end game , its more of hey lets trap the king in a box untill you could then snipe the king with your bishop (like his king could be on h1) yours on h3 your 1 bishop targeting g1 and the other bishop anywhere on the main diagonal with no other pieces.

But this is very much stalematey , since I am currently pondering how to get to this position without a stalemate! , if you move the bishop later , its stalement , Like seriously. https://www.chess.com/forum/view/endgames/two-bishop-checkma...

Just search 2 bishop checkmate is hard , a lot of guides exist just for this purpose , though in my 1000+ games I rarely got once or twice 2 bishop endgame , usually bishop or knight which is just as tricky or if I recall , the worst is knight and knight.


Replying to your other questions: Its been a while since I played chess regularly (in a chess club), but:

Two bishops (of different colour) is actually not that difficult. There are some simple heuristics to help you there (an LLM might actually tell you these, haven’t asked;-0)

Bishop+Knight is, in my opinion slightly more complicated, there are some ‘tricks’ necessary to keep the king from running from one courner to the next.

Bishop+bishop is - in most situations - a draw (you need three knights to mate).


oh I didn't knew bishop + bishop is draw in most situations. Sorry mate!

Also I am not sure , I thought that we were playing as white , but are we playing as black ?


But there is a reasoning (see my reply above): winning is not possible (only the queen is strong enough against two bishops), so draw should be the goal. And underpromoting to knight is only way to keep the piece for another move while still promoting.


its actually surprising how many difficult puzzles can be solved by a very small look ahead and playing the only move that doesn't lose. i've even seen strong GM solve puzzles like this. this is especially useful when the first move in the puzzle is very clear but there might be 5 or 6 reasonable candidate moves in reply and its just a waste of time to compute each variation.


Humans can also be disappointing and interesting. I like to do Lichess puzzles not logged in, which mostly gives puzzles with a Lichess puzzle ratings in the 1400-1600 range with some going down to around 1200 or up to the 1700s. Presumably that is the range the average Lichess player is in.

For those who have not used Lichess, the puzzles it gives (unless you ask for a specific type) do not tell you what the goal is (mate, win material, get a winning endgame, save a bad position, etc) or how many moves it will take.

Here are some puzzles it has recently given me and their current ratings. These all have something in common.

  1492 https://lichess.org/training/KsrR0
  1506 https://lichess.org/training/RwLfy
  1545 https://lichess.org/training/TzZdx
  1557 https://lichess.org/training/IJfT7
  1564 https://lichess.org/training/oOMz4
  1604 https://lichess.org/training/uRRck
  1661 https://lichess.org/training/jBrLX
  1719 https://lichess.org/training/cpKAM
What they have in common is that they are all mate in one. I have seen composed mate in ones that puzzled even high rated players, but they involved something unusual like the mating move was an en passant capture.

None of the above puzzles are tricks like that.

So how are enough people failing them for their ratings to be that high?


i assume lichess has time based puzzles so people can be failing the puzzles because they are trying to optimize how many puzzles they solve rather than making no mistakes. also, i suspect lichess puzzle ratings probably do not match with lichess chess ratings (sure, they probably correlate but i suspect there could easily be a +200 average difference or something like that between them) . i can solve puzzles consistently at least 1000 rating higher on chess.com than my chess.com rapid rating. also, if you know these puzzles are mate in 1 then they are much easier. i'm guessing you didn't know when you initially solved them but i think its much harder for other people to judge the difficulty of these puzzles if they know the solution is mate in 1.


Lichess does have some puzzle modes that are timed, such as Puzzle Storm where you try to see how many puzzles you can solve in a fixed time, but their puzzle rating system doesn't take into account puzzles used in the timed modes.


1..c1Q?? 2.Ba3+ loses on the spot, hence it is deduced easily that 1..c1N+! must be the best try. Consulting an online table base [1] shows that the resulting position is a ”blessed draw”: White can still mate, but not in time to evade a 50 move draw claim.

[1] https://syzygy-tables.info/?fen=8/6B1/8/8/B7/8/K1pk4/8_b_-_-...


A while ago, dynomight went in and tried to explain what's going on, and why LLMs are bad at chess.

See: https://dynomight.net/more-chess/

HN Discussion of that article: https://news.ycombinator.com/item?id=42206817

(Edits: It turns out "it's complicated").


Is there a standard way to read the puzzles if it doesn’t tell you whose move it is? Also which direction is each player moving in? Does pawn move up or is pawn in home position?


For the direction a quick way to tell from the board diagram in this case is to note that across the bottom and right side it has the column letters and row numbers. Row 1 is where the white pieces start and row 8 is where the black pieces start. Since this board has row 8 on the bottom we know that the board is oriented with black pawns moving up.

For whose turn it is you can look at the FEN for the position, which is given at the bottom of the article:

> 8/6B1/8/8/B7/8/K1pk4/8 b - - 0 1

The second field in that indicates whose move it is: "b" for black and "w" for white.

You can also tell that the pawn is close to promoting from the FEN. The first field is simply a slash separated list of the contents of each row, starting from the black side of the board (row 8). In the entry for each row the numbers represent runs of empty squares. The letters represent white pieces (KQRBN) or white pawns (P) or black pieces (kqrbn) or black pawns (p).

Black pawns start at row 7, and we can see that this one is in row 2.


Thank you!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: