Hacker News new | past | comments | ask | show | jobs | submit login

> For one, gpt-3.5-turbo-instruct rarely suggests illegal moves, even in the late game. This requires “understanding” chess.

Here's one way to test whether it really understands chess. Make it play the next move in 1000 random legal positions (in which no side is checkmated yet). Such positions can be generated using the ChessPositionRanking project at [1]. Does it still rarely suggest illegal moves in these totally weird positions, that will be completely unlike any it would have seen in training (and in which the legal move choice is often highly restricted) ?

While good for testing legality of next moves, these positions are not so useful for distinguishing their quality, since usually one side already has an overwhelming advantage.

[1] https://github.com/tromp/ChessPositionRanking






Interesting tidbit I once learned from a chess livestream. Even human super-GMs have a really hard time "scoring" or "solving" extremely weird positions. That is, positions that shouldn't come from logical opening - mid game - end game regular play.

It's absolutely amazing to see a super-GM (in that case it was Hikaru) see a position, and basically "play-by-play" it from the beginning, to show people how they got in that position. It wasn't his game btw. But later in that same video when asked he explained what I wrote in the first paragraph. It works with proper games, but it rarely works with weird random chess puzzles, as he put it. Or, in other words, chess puzzles that come from real games are much better than "randomly generated", and make more sense even to the best of humans.


"Even human super-GMs have a really hard time "scoring" or "solving" extremely weird positions. "

I can sort of confirm that. I never learned all the formal theoretical standard chess strategies except for the basic ones. So when playing against really good players, way above my level, I could win sometimes (or allmost) simply by making unconventional (dumb by normal strategy) moves in the beginning - resulting in a non standard game where I could apply pressure in a way the opponent was not prepared for (also they underestimated me after the initial dumb moves). For me, the unconventional game was just like a standard game, I had no routine - but for the experienced one, it was way more challenging. But then of course in the standard situations, to which allmost every chess game evolves to - they destroyed me, simply for experience and routine.


Huh it's funny, in fencing that also works to a certain degree.

You can score points against e.g. national team members who've been 5-0'ing the rest of the pool by doing weird cheap tricks. You won't win though, because after one or two points they will adjust and then wreck you.

And on the flip side, if you're decently rated (B ~ A ish) and are used to just standard fencing, if you run into someone who's U ~ E and does something weird like literally not move their feet, it can take you a couple touches to readjust to someone who doesn't behave normally.

Unlike chess though, in fencing the unconventional stuff only works for a couple points. You can't stretch that into a victory, because after each point everything resets.

Maybe that's why pentathlon (single touch victory) fencing is so weird.


Watching my son compete at a fighting game tournament at a professional level, can confirm this also exists in that realm. And problem other realms; I think it's more of a general concept of unsettling the better opponent so that you can have a short-term advantage at the beginning.

> So when playing against really good players, way above my level, I could win sometimes (or allmost) simply by making unconventional (dumb by normal strategy) moves in the beginning - resulting in a non standard game where I could apply pressure in a way the opponent was not prepared for (also they underestimated me after the initial dumb moves).

IIRC Magnus Carlsen is said to do something like this as well - he'll play opening lines that are known to be theoretically suboptimal to take his opponent out of prep, after which he can rely on his own prep/skills to give him better winning chances.


The book Chess for Tigers by Simon Webb explicitly advises this. Against "heffalumps" who will squash you, make the situation very complicated and strange. Against "rabbits", keep the game simple.

In The Art of Learning, Joshua Waitzkin talks about how this was a strategy for him in tournaments as a child as well. While most other players were focusing on opening theory, he focused on end game and understanding how to use the different pieces. Then, by going with unorthodox openings, he could easily bring most players outside of their comfort zone where they started making mistakes.

That Expert players are better at recreate real games than ‘fake’ positions is one of the things Adriaan de Groot (https://en.wikipedia.org/wiki/Adriaan_de_Groot) noticed in his studies with expert chess players. (“Thought and choice in chess“ is worth reading if you’re interested in how chess players think. He anonymized his subjects, but Euwe apparently was on of them)

Another thing he noticed is that, when asked to set up a game they were shown earlier, the errors expert players made often were insignificant. For example, they would set up the pawn structure on the king side incorrectly if the game’s action was on the other side of the board, move a bishop by a square in such a way didn’t make a difference for the game, or even add an piece that wasn’t active on the board.

Beginners would make different errors, some of them hugely affecting the position on the board.


Super interesting (although it also makes some sense that experts would focus on "likely" subsets given how the number of permutations of chess games is too high for it to be feasible to learn them all)! That said, I still imagine that even most intermediate chess players would perfectly make only _legal_ moves in weird positions, even if they're low quality.

This is technically true, but the kind of comment that muddies the waters. It's true that GM performance is better in realistic games.

It is false that GMs would have any trouble determining legal moves in randomly generated positions. Indeed, even a 1200 level player on chess.com will find that pretty trivial.


As someone who finds chess problems interesting (I'm bad at them), they're really a third sort of thing. In that good chess problems are rarely taken from live play, they're a specific sort of thing which follows its own logic.

Good ones are never randomly generated, however. Also, the skill doesn't fully transfer in either direction between live play and solving chess problems. Definitely not reconstructing the prior state of the board, since there's nothing there to reconstruct.

So yes, everything Hikaru was saying there makes sense to me, but I don't think your last sentence follows from it. Good chess problems come from good chess problem authors (interestingly this included Vladimir Nabokov), they aren't random, but they rarely come from games, and tickle a different part of the brain from live play.


Would love a link to that video!

It’s kind of crazy to assert that the systems understand chess, and then disclose further down the article that sometimes he failed to get a legal move after 10 tries and had to sub in a random move.

A person who understands chess well (Elo 1800, let’s say) will essentially never fail to provide a legal move on the first try.


He is testing several models, some of which cannot reliably output legal moves. That's different from saying all models including the one he thinks understands can't generate a legal move in 10 tries.

3.5-turbo-instruct's illegal move rate is about 5 or less in 8205


I also wonder what kind of invalid moves they are. There's "you can't move your knight to j9 that's off the board", "there's already a piece there" and "actually that would leave you in check".

I think it's also significantly harder to play chess if you were to hear a sequence of moves over the phone and had to reply with a followup move, with no space or time to think or talk through moves.


What do you mean by "understand chess"?

I think you don't appreciate how good the level of chess displayed here is. It would take an average adult years of dedicated practice to get to 1800.

The article doesn't say how often the LLM fails to generate legal moves in ten tries, but it can't be often or the level of play would be much much much worse.

As seems often the case, the LLM seems to have a brilliant intuition, but no precise rigid "world model".

Of course words like intuition are anthropomorphic. At best a model for what LLMs are doing. But saying "they don't understand" when they can do _this well_ is absurd.


> But saying "they don't understand" when they can do _this well_ is absurd.

When we talk about understanding a simple axiomatic system, understanding means exactly that the entirety of the axioms are modeled and applied correctly 100% of the time. This is chess, not something squishy like literary criticism. There’s no need to debate semantics at all. One illegal move is a deal breaker

Undergraduate CS homework for playing any game with any technique would probably have the stipulation that any illegal move disqualifies the submission completely. Whining that it works most of the time would just earn extra pity/contempt as well as an F on the project.

We can argue whether an error rate of 1 in a million means that it plays like a grandmaster or a novice, but that’s less interesting. It failed to model a simple system correctly, and a much shorter/simpler program could do that. Doesn’t seem smart if our response to this as an industry is to debate semantics, ignore the issue, and work feverishly to put it to work modeling more complicated / critical systems.


> When we talk about understanding a simple axiomatic system, understanding means exactly that the entirety of the axioms are modeled and applied correctly 100% of the time.

Yes, but then, when we talk about understanding in LLMs, we talk about existence, but not necessarily about determinism.

Remember that, while chess engines are (I guess?) deterministic systems, LLMs are randomized systems. You give the same context and the same prompt multiple times, and each and every time you get a different response.

To me, this, together with the fact that you have an at least 1-in-10 chance of getting a good move (even for strange scenarios), means that understanding _does exist_ inside the LLM. And the problem following from this is, how to force the LLM to reliably choose the right „paths of thought“ (sorry for that metaphor).


You just made up a definition of "understand". According to that definition, you are of course right. I just don't think it's a reasonable definition. It's also contradicted by the person I was replying to in the sibling comment, where they argue that Stockfish doesn't understand chess, despite Stockfish of course having the "axioms" modeled and applied correctly 100% of the time.

Here are things people say:

Magnus Carlsen has a better understanding of chess than I do. (Yet we both know the precise rules of the game.)

Grandmasters have a very deep understanding of Chess, despite occasionally making illegal moves that are not according to the rules (https://www.youtube.com/watch?v=m5WVJu154F0).

"If AlphaZero were a conventional engine its developers would be looking at the openings which it lost to Stockfish, because those indicate that there's something Stockfish understands better than AlphaZero." (https://chess.stackexchange.com/questions/23206/the-games-al...)

> Undergraduate CS homework for playing any game with any technique would probably have the stipulation that any illegal move disqualifies the submission completely. Whining that it works most of the time would just earn extra pity/contempt as well as an F on the project.

How exactly is this relevant to the question whether LLMs can be said to have some understanding of chess? Can they consistently apply the rules when game states are given in pgn? No. _Very_ few humans without specialized training could either (without using a board as a tool to keep track of the implicit state). They certainly "know" the rules (even if they can't apply them) in the sense that they will state them correctly if you ask them to.

I am not particularly interested in "the industry". It's obvious that if you want a system to play chess, you use a chess engine, not an LLM. But I am interested in what their chess abilities teaches us about how LLMs build world models. E.g.:

https://aclanthology.org/2024.acl-srw.48/


Thanks for your thoughtful comment and refs to chase down.

> You just made up a definition of "understand". According to that definition, you are of course right. I just don't think it's a reasonable definition. ... Here are things people say:

Fine. As others have pointed out and I hinted at.. debating terminology is kind of a dead end. I personally don't expect that "understanding chess" is the same as "understanding Picasso", or that those phrases would mean the same thing if they were applied to people vs for AI. Also.. I'm also not personally that interested in how performance stacks up compared to humans. Even if it were interesting, the topic of human-equivalent performance would not have static expectations either. For example human-equivalent error rates in AI are much easier for me to expect and forgive in robotics than they are in axiomatic game-play.

> I am interested in what their chess abilities teaches us about how LLMs build world models

Focusing on the single datapoint that TFA is establishing: some LLMs can play some chess with some amount of expertise, with some amount of errors. With no other information at all, this tells us that it failed to model the rules, or it failed in the application of those rules, or both.

Based on that, some questions worth asking: Which one of these failure modes is really acceptable and in which circumstances? Does this failure mode apply to domains other than chess? Does it help if we give it the model directly, say by explaining the rules directly in the prompt and also explicitly stating to not make illegal moves? If it's failing to apply rules, but excels as a model-maker.. then perhaps it can spit out a model directly from examples, and then I can feed the model into a separate engine that makes correct, deterministic steps that actually honor the model?

Saying that LLMs do or don't understand chess is lazy I guess. My basic point is that the questions above and their implications are so huge and sobering that I'm very uncomfortable with premature congratulations and optimism that seems to be in vogue. Chess performance is ultimately irrelevant of course, as you say, but what sits under the concrete question is more abstract but very serious. Obviously it is dangerous to create tools/processes that work "most of the time", especially when we're inevitably going to be giving them tasks where we can't check or confirm "legal moves".


> I think you don't appreciate how good the level of chess displayed here is. It would take an average adult years of dedicated practice to get to 1800.

Since we already have programs that can do this, that definitely aren’t really thinking and don’t “understand” anything at all, I don’t see the relevance of this part.


It seems you're shifting the discourse here. In the context of LLMs, "to understand" is short for "to have a world model beyond the pure relations between words". In that sense, chess engines do "understand" chess, as they operate on a world model. You can even say that they don't understand anything but chess, which makes them extremely un-intelligent and definitely not capable of understanding as we mean it.

However, since an LLM is a generalist engine, if it understands chess there is no reason for it not to understand millions of other concepts and how they relate to each other. And this is the kind of understanding that humans do.


I hate the use of words like "understand" in these conversations.

The system understands nothing, it's anthropomorphising it to say it does.


I have the same conclusion, but for the opposite reason.

It seems like many people tend to use the word "understand" to that not only does someone believe that a given move is good, they also belive that this knowledge comes from a rational evaluation.

Some attribute this to a non-material soul/mind, some to quantum mechanics or something else that seems magic, while others never realized the problem with such a belief in the first place.

I would claim that when someone can instantly recognize good moves in a given situation, it doesn't come from rationality at all, but from some mix of memory and an intuition that has been build by playing the game many times, with only tiny elements of actual rational thought sprinkled in.

This even holds true when these people start to calculate. It is primarily their intuition that prevens them from spending time on all sorts of unlikely moves.

And this intuition, I think, represents most of their real "understanding" of the game. This is quite different from understanding something like a mathematical proof, which is almost exclusively inducive logic.

And since "understand" so often is associated with rational inductive logic, I think the proper term would be to have "good intuition" when playing the game.

And this "good intuition" seems to me precisely the kind of thing that is trained within most neural nets, even LLM's. (Q*, AlphaZero, etc also add the ability to "calculate", meaning traverse the search space efficiently).

If we wanted to measure how good this intuition is compared to human chess intuition, we could limit an engine like AlphaZero to only evaluate the same number of moves per second that good humans would be able to, which might be around 10 or so.

Maybe with this limitation, the engine wouldn't currently be able to beat the best humans, but even if it reaches a rating of 2000-2500 this way, I would say it has a pretty good intuitive understanding.


Trying to appropriate perfectly well generalizable terms as "something that only humans do" brings zero value to a conversation. It's a "god in the gaps" argument, essentially, and we don't exactly have a great track record of correctly identifying things that are uniquely human.

There's very literally currently a whole wealth of papers proving that LLMs do not understand, cannot reason, and cannot perform basic kinds of reasoning that even a dog can perform. But, ok.

There's a whole wealth of papers proving that LLMs do not understand the concepts they write about. That doesn't mean they don't understand grammar – which (as I've claimed since the GPT-2 days) we should, theoretically, expect them to "understand". And what is chess, but a particularly sophisticated grammar?

There's very literally currently a whole wealth of papers proving the opposite, too, so ¯\_(ツ)_/¯.

The whole point of this exercise is to understand what "understand" even means. Because we really don't have a good definition for this, and until we do, statements like "the system understands nothing" are vacuous.

Pretty sure elo 1200 will only give legal moves. It's really not hard to make legal moves in chess.

Casual players make illegal moves all the time. The problem isn't knowing how the pieces move. It's that it's illegal to leave your own king in check. It's not so common to accidentally move your king into check, though I'm sure it happens, but it's very common to accidentally move a piece that was blocking an attack on your king.

I would tend to agree that there's a big difference between attempting to make a move that's illegal because of the state of a different region of the board, and attempting to make one that's illegal because of the identity of the piece being moved, but if your only category of interest is "illegal moves", you can't see that difference.

Software that knows the rules of the game shouldn't be making either mistake.


Casual players don’t make illegal moves so often that you have to assign them a random move after 10 goes.

I think at this point it’s very clear LLM aren’t achieving any form of “reasoning” as commonly understood. Among other factors it can be argued that true reasoning involves symbolic logic and abstractions, and LLM are next token predictors.

I don't want to say that LLMs can reason, but this kind of argument always feels to shallow for me. It's kind of like saying that bats cannot possibly fly because they have no feathers or that birds cannot have higher cognitive functions because they have no neocortex. (The latter having been an actual longstanding belief in science which has been disproven only a decade or so ago).

The "next token prediction" is just the API, it doesn't tell you anything about the complexity of the thing that actually does the prediction. (In think there is some temptation to view LLMs as glorified Markov chains - they aren't. They are just "implementing the same API" as Markov chains).

There is still a limit how much an LLM could reason during prediction of a single token, as there is no recurrence between layers, so information can only be passed "forward". But this limit doesn't exist if you consider the generation of the entire text: Suddenly, you do have a recurrence, which is the prediction loop itself: The LLM can "store" information in a generated token and receive that information back as input in the next loop iteration.

I think this structure makes it quite hard to really say how much reasoning is possible.


I agree with most of what you said, but “LLM can reason” is an insanely huge claim to make and most of the “evidence” so far is a mixture of corporate propaganda, “vibes”, and the like.

I’ve yet to see anything close to the level of evidence needed to support the claim.


To say any specific LLM can reason is a somewhat significant claim.

To say LLMs as a class is architecturally able to be trained to reason is - in the complete absence of evidence to suggest humans can compute functions outside the Turing computable - is effectively only an argument that they can implement a minimal Turing machine given the context is used as IO. Given the size of the rules needed to implement the smallest known Turing machines, it'd take a really tiny model for them to be unable to.

Now, you can then argue that it doesn't "count" if it needs to be fed a huge program step by step via IO, but if it can do something that way, I'd need some really convincing evidence for why the static elements those steps could not progressively be embedded into a model.


No such evidence exists: we can construct such a model manually. I'd need some quite convincing evidence that any given training process is approximately equivalent to that, though.

That's fine. I've made no claim about any given training process. I've addressed the annoying repetitive dismissal via the "but they're next token predictors" argument. The point is that being next token predictors does not limit their theoretical limits, so it's a meaningless argument.

The architecture of the model does place limits on how much computation can be performed per token generated, though. Combined with the window size, that's a hard bound on computational complexity that's significantly lower than a Turing machine – unless you do something clever with the program that drives the model.

Hence the requirement for using the context for IO. A Turing machine requires two memory "slots" (the position of the read head, and the current state) + IO and a loop. That doesn't require much cleverness at all.

Then say "no one has demonstrated that LLMs can reason" instead of "LLMs can't reason, they're just token predictors". At least that would be intellectually honest.

By that logic isn't it "intellectually dishonest" to say "dowsing rods don't work" if the only evidence we have is examples of them not working?

Not really. We know enough about how the world to know that dowsing rods have no plausible mechanism of action. We do not know enough about intelligence/reasoning or how brains work to know that LLMs definitely aren't doing anything resembling that.

"LLM can reason" is trivially provable - all you need to do is give it a novel task (e.g. a logical puzzle) that requires reasoning, and observe it solving that puzzle.

How do you intend to show your task is novel?

"Novel" here simply means that the exact sequence of moves that is the solution cannot possibly be in the training set (mutatis mutandis). You can easily write a program that generates these kinds of puzzles at random, and feed them to the model.

It's largely dependent on what we think "reason" means, is it not? That's not a pro argument from me, in my world LLMs are stochastic parrots.

> But this limit doesn't exist if you consider the generation of the entire text: Suddenly, you do have a recurrence, which is the prediction loop itself: The LLM can "store" information in a generated token and receive that information back as input in the next loop iteration.

Now consider that you can trivially show that you can get an LLM to "execute" on step of a Turing machine where the context is used as an IO channel, and will have shown it to be Turing complete.

> I think this structure makes it quite hard to really say how much reasoning is possible.

Given the above, I think any argument that they can't be made to reason is effectively an argument that humans can compute functions outside the Turing computable set, which we haven't the slightest shred of evidence to suggest.


It's kind of ridiculous to say that functions computable by turing computers are the only ones that can exist(and that trained llms are Turing computers).

What evidence do you have for either of these, since I don't recall any proof that "functions computable by Turing machines" is equal to the set of functions that can exist. And I don't recall pretrained llms being proven to be Turing machines.


We don't have hard evidence that no other functions exist that are computable, but we have no examples of any such functions, and no theory for how to even begin to formulate any.

As it stands, Church, Turing, and Kleene have proven that the set of generally recursive functions, the lambda calculus, and the Turing computable set are equivalent, and no attempt to categorize computable functions outside those sets has succeeded since.

If you want your name in the history books, all you need to do is find a single function that humans can compute that a is outside the Turing computable set.

As for LLMs, you can trivially test that they can act like a Turing machine if you give them a loop and use the context to provide access to IO: Turn the temperature down, and formulate a prompt to ask one to follow the rules of the simplest known Turing machine. A reminder that the simplest known Turing machine is a 2-state, 3-symbol Turing machine. It's quite hard to find a system that can carry out any kind of complex function that can't act like a Turing machine if you allow it to loop and give it access to IO.


> Among other factors it can be argued that true reasoning involves symbolic logic and abstractions, and LLM are next token predictors.

I think this is circular?

If an LLM is "merely" predicting the next tokens to put together a description of symbolic reasoning and abstractions... how is that different from really exercisng those things?

Can you give me an example of symbolic reasoning that I can't handwave away as just the likely next words given the starting place?

I'm not saying that LLMs have those capabilities; I'm question whether there is any utility in distinguishing the "actual" capability from identical outputs.


It is. As it stands, throw a loop around an LLM and act as the tape, and an LLM can obviously be made Turing complete (you can get it to execute all the steps of a minimal Turing machine, so drop temperature so its deterministic, and you have a Turing complete system). To argue that they can't be made to reason is effectively to argue that there is some unknown aspect of the brain that allows us to compute functions not in the Turing computable set, which would be an astounding revelation if it could be proven. Until someone comes up with evidence for that, it is more reasonable to assume that it is a question of whether we have yet found a training mechanism that can lead to reasoning or not, not whether or not LLMs can learn to.

It doesn’t follow that because a system is Turing complete the approach being used will eventually achieve reasoning.

No, but that was also not the claim I made.

The point is that as the person I replied to pointed out, that LLM's are "next token predictors" is a meaningless dismissal, as they can be both next token predictors and Turing complete, and given that unless reasoning requires functions outside the Turing computable (we know of no way of constructing such functions, or no way for them to exist) calling them "next token predictors" says nothing about their capabilities.


Mathematical reasoning is the most obvious area where it breaks down. This paper does an excellent job of proving this point with some elegant examples: https://arxiv.org/pdf/2410.05229

Sure, but people fail at mathematical reasoning. That doesn't mean people are incapable of reasoning.

I'm not saying LLMs are perfect reasoners, I'm questioning the value of asserting that they cannot reason with some kind of "it's just text that looks like reasoning" argument.


The idea is the average person would, sure. A mathematically oriented person would fair far better.

Throw all the math problems you want at a LLM for training; it will still fail if you step outside of the familiar.


> it will still fail if you step outside of the familiar.

To which I say:

ᛋᛟ᛬ᛞᛟ᛬ᚻᚢᛗᚪᚾᛋ


ᛒᚢᛏ ᚻᚢᛗᚪᚾ ᚻᚢᛒᚱᛁᛋ ᛈᚱᛖᚹᛖᚾᛏ ᚦᛖᛗ ᚠᚱᛟᛗ ᚱᛖᚪᛚᛁᛉᛁᚾᚷ ᚦᚻᚪᛏ

ᛁᚾᛞᛖᛖᛞ᛬ᛁᛏ᛬ᛁᛋ᛬ᚻᚢᛒᚱᛁᛋ

ᛁ᛬ᚻᚪᚹᛖ᛬ᛟᚠᛏᛖᚾ᛬ᛋᛖᛖᚾ᛬ᛁᚾ᛬ᛞᛁᛋᚲᚢᛋᛋᛁᛟᚾᛋ᛬ᛋᚢᚲ᛬ᚪᛋ᛬ᚦᛁᛋ᛬ᚲᛚᚪᛁᛗᛋ᛬ᚦᚪᛏ᛬ᚻᚢᛗᚪᚾ᛬ᛗᛁᚾᛞᛋ᛬ᚲᚪᚾ᛬ᛞᛟ᛬ᛁᛗᛈᛟᛋᛋᛁᛒᛚᛖ᛬ᚦᛁᛝᛋ᛬ᛋᚢᚲ᛬ᚪᛋ᛬ᚷᛖᚾᛖᚱᚪᛚᛚᚣ᛬ᛋᛟᛚᚹᛖ᛬ᚦᛖ᛬ᚻᚪᛚᛏᛁᛝ᛬ᛈᚱᛟᛒᛚᛖᛗ

edit: Snap, you said the same in your other comment :)


Switching back to latin letters...

It seems to me that the idea of the Universal Turing Machine is quite misleading for a lot of people, such as David Deutsch.

My impression is that the amount of compute to solve most problems that can really only be solved by Turing Machines is always going to remain inaccessible (unless they're trivally small).

But at the same time, the universe seems to obey a principle of locality (as long as we only consider the Quantum Wave Function, and don't postulate that it collapses).

Also, the quantum fields are subject to some simple (relative to LLMs) geometric symmetries, such as invariance under the U(1)xSU(2)xSU(3) group.

As it turns out, similar group symmetries can be found in all sorts of places in the real world.

Also it seems to me that at some level, both ANN's and biological brains set up a similar system to this physical reality, which may explain why brains develop this way and why both kinds are so good at simulating at least some aspects of the physical world, such as translation, rotation, some types of deformation, gravity, sound, light etc.

And when biological brains that initially developed to predict the physical world is then use to create language, that language is bound to use the same type of machinere. And this may be why LLM's do language so well with a similar architecture.


There are no problems that can be solved only by Turing Machines as any Turing complete system can simulate any other Turing complete system.

The point of UTM's is not to ever use them, but that they're a shortcut to demonstrating Turing completeness because of their simplicity. Once you've proven Turing completeness, you've proven that your system can compute all Turing computable functions and simulate any other Turing complete system, and we don't know of any computable functions outside this set.


When I wrote Turing Machine, I meant it as shorthand for Turing complete system.

My point is that any such system is extremely limited due to how slow they become at scale (when running algorithms/programs that require full turing completeness), due to it's "single threaded" nature. Such algorithms simply are not very parallelizable.

This means a Turing Complete system becomes nearly useless for things like AI. The same is the case inside a human brain, where signals can only travel at around the speed of sound.

Tensor / neuron based systems sacrifice Turing Completeness to gain (many) orders of magnitude more compute speed.

I know that a GPU's CAN in principle emulate a Turing Complete system, but they're even slower at it than CPU's, so that's irrelevant. The same goes for human brains.

People like Deutsch are so in love with the theoretical universality of Turing Completeness that he seems to ignore that Turing Complete system might take longer to formulate meaningful thought than the lifetime of a human. And possibly longer than the lifetime of the Sun, for complex ideas.

The fact that so much can be done by systems that are NOT Turing Complete may seem strange. But I argue that since the laws of Physics are local (with laws described by tensors), it should not be such a surprise that systems that computers that perform tensor computations are pretty good at simulating physical reality.


> My point is that any such system is extremely limited due to how slow they become at scale (when running algorithms/programs that require full turing completeness), due to it's "single threaded" nature. Such algorithms simply are not very parallelizable.

My point is that this isn't true. Every computer you've ever used is a Turing complete system, and there is no need for such a system to be single-threaded, as a multi-threaded system can simulate a single-threaded system and vice versa.

> I know that a GPU's CAN in principle emulate a Turing Complete system, but they're even slower at it than CPU's, so that's irrelevant. The same goes for human brains.

Any system that can emulate any other Turing complete system is Turing complete, so they are Turing complete.

You seem to confuse Turing completeness with a specific way of executing something. Turing completeness is about the theoretical limits on which set of functions a system can execute, not how they execute them.


> My point is that this isn't true. Every computer you've ever used is a Turing complete system, and there is no need for such a system to be single-threaded, as a multi-threaded system can simulate a single-threaded system and vice versa.

Not all algorithms can be distributed effectively across multiple threads.

A computer can have 1000 cpu cores, and only be able to use a single one when running such algorithms.

Some other algorithms may be distributed through branch predictions, by trying to run future processing steps ahead of time for each possible outcome of the current processing step. In fact, modern CPU's already do this a lot to speed up processing.

But even branch prediction hits something like a logarithmic wall of diminishing returns.

While you are right that multi core CPU's (or whole data centers) can run such algorithms, that doesn't mean they can run them quickly, hence my claim:

>> My point is that any such system is extremely limited due to how slow

Algorithms that can only utilize a single core seem to be stuck at the GFLOPS scale, regardless of what hardware they run on.

Even if only a small percentage (like 5%) of the code in a program is inherently limited to being single threaded (At best, you will achieve TFlops numbers), this imposes a fatal limitation on computational problems that require very large amounts of computing power. (For instance at the ExaFlop scale or higher.)

THIS is the flaw of the Turing Completness idea. Algorithms that REQUIRE the full feature set of Turing Completeness are in some cases extremely slow.

So if you want to do calculations that require, say, 1 ExaFlop (about the raw compute of the human brain) to be fast enough for a given purpose, you need to make almost all compute steps fully parallelizable.

Now that you've restricted your algorithm to no longer require all features of a Turing Complete system, you CAN still run it on Turing Complete CPU's. You're just not MAKING USE of their Turing Completeness. That's just very expensive.

At this point, you may as well build dedicated hardware that do not have all the optimizations that CPU have for single threaded computation, like GPU's or TPU's, and lower your compute cost by 10x, 100x or more (which could be the difference between $500 billion and $billion).

At this point, you've left the Turing Completness paradigm fully behind. Though the real shift happened when you removed those requirements from your algorithm, not when you shifted the hardware.

One way to describe this, is that from the space of all possible algorithms that can run on a Turing Complete system, you've selected a small sub-space of algorithms that can be parallelized.

By doing this trade, you've severely limited what algorithms you can run, in exchange for the ability to get a speedup of 6 orders of magnitude, or more in many cases.

And in order to keep this speedup, you also have to accept other hardware based limitations, such as staying within the amount of GPU memory available, etc.

Sure, you COULD train GPT-5 or Grok-3 on a C-64 with infinite casette tape storage, but it would take virtually forever for the training to finish. So that fact has no practical utility.

I DO realize that the concept of the equivalence of all Turing Complete systems is very beautiful. But this CAN be a distraction, and lead to intuitions that seem to me to be completely wrong.

Like Deutsch's idea that the ideas in a human brain are fundamentally linked to the brain's Turing Completeness. While in reality, it takes years of practice for a child to learn how to be Turing Complete, and even then the child's brain will struggle to do a floating point calculation every 5 minutes.

Meanwhile, joint systems of algorithms and the hardware they run on can do very impressive calculations when ditching some of the requirements of Turing Completeness.


> Not all algorithms can be distributed effectively across multiple threads.

Sure, but talking about Turing completeness is not about efficiency, but about the computational ability of a system.

> THIS is the flaw of the Turing Completness idea. Algorithms that REQUIRE the full feature set of Turing Completeness are in some cases extremely slow.

The "feature set" of Turing Completeness can be reduced to a loop, an array lookup, and an IO port.

It's not about whether the algorithms require a Turing complete system, but that Turing completeness proves the equivalence of the upper limit of which set of functions the architecture can compute, and that pretty much any meaningful architecture you will come up with is still Turing complete.

> At this point, you've left the Turing Completness paradigm fully behind. Though the real shift happened when you removed those requirements from your algorithm, not when you shifted the hardware.

If a system can take 3 bits of input and use it to look up 5 bits of output in a table of 30 bits of data, and it is driven by a driver that uses 1 bit of the input as the current/new state, 1 bit for the direction to move the tape, and 3 bits for the symbol to read/write, and that driver processes the left/right/read/write tape operations and loops back, you have Turing complete system (Wolfram's 2-state 3-symbol Turing machine).

So no, you have not left Turing completeness behind, as any function that can map 3 bits of input to 5 bits of output becomes a Turing complete system if you can put a loop and IO mechanism around it.

Again, the point is not that this is a good way of doing something, but that it serves as a way to point out that what it takes to make an LLM Turing complete is so utterly trivial.


> Sure, but talking about Turing completeness is not about efficiency, but about the computational ability of a system.

I know. That is part of my claim that this "talking about Turing completeness" is a distraction. Specifically because it ignores efficiency/speed.

> Again, the point is not that this is a good way of doing something, but that it serves as a way to point out that what it takes to make an LLM Turing complete is so utterly trivial.

And again, I KNOW that it's utterly trivial to create a Turing Complete system. I ALSO know that a Turing Complete system can perform ANY computation (it pretty much defines what a computation is), given enough time and memory/storage.

But if such a calculation takes 10^6 times longer than necessary, it's also utterly useless to approach it in this way.

Specifically, the problem with Turing Completeness is that it implies the ability to create global branches/dependencies in the code based on the output of any previous computation step.

> The "feature set" of Turing Completeness can be reduced to a loop, an array lookup, and an IO port.

This model is intrinsically single threaded, so the global branches/dependencies requirement is trivially satisifed.

Generally, though, if you want to be able to distribute a computation, you have to pretty much never allow the results of a computation of any arbitrary compute thread to affect the next computation on any of the other threads.

NOBODY would be able to train LLM's that are anything like the ones we see today, if they were not willing to make that sacrifice.

Also, downstream from this is the hardware optimizations that are needed to even run these algorithms. While you _could_ train any of the current LLM's on large CPU clusters, a direct port would require perhaps 1000x more hardware, electricity, etc than running it on GPU's or TPU's.

Not only that, but if the networks being trained (+ some amount of training data) couldn't fit into the fast GPU/TPU memory during training, but instead had to be swapped in and out of system memory or disk, then that would also cause orders of magnitude of slowdown, even if using GPU/TPU's for the training.

In other words, what we're seeing is a trend towards ever increasing coupling between algorithms being run and the hardware they run on.

When I say that thinking in terms of Turing Completeness is a distraction, it doesn't mean it's wrong.

It's just irrelevant.


> NOBODY would be able to train LLM's that are anything like the ones we see today, if they were not willing to make that sacrifice.

Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice, is the point. Because Turing completeness does not mean all, or most, or even any of your computations need to be carried out like in a UTM. It only means it needs the theoretical ability. They can take any set of shortcuts you want.


> Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice, is the point.

I don't think you understood what I was writing. I wasn't saying that either the LLM (finished product OR the machines used for training them) were not Turing Complete. I said it was irrelevant.

> It only means it needs the theoretical ability.

This is absolutely incorporated in my previous post. Which is why I wrote:

>> Specifically, the problem with Turing Completeness is that it implies the ability to create global branches/dependencies in the code based on the output of any previous computation step.

> It only means it needs the theoretical ability. They can take any set of shortcuts you want.

I'm not talking about shortcuts. When I talk about sacrificing, I'm talking about algorithms that you can run on any Turing Complete machine that are (to our knowledge) fundamentally impossible to distribute properly, regardless of shortcuts.

Only by staing within the subset of all possible algorithms that CAN be properly paralellized (and have the proper hardware to run it) can you perform the number of calculations needed to train something like an LLM.

> Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice,

Which, to the degree that it's true, is irrelevant for the reason that I'm saying Turing Completeness is a distraction. You're not likely to run algorithms that require 10^20 to 10^25 steps within the context of an LLM.

On the other hand, if you make a cluster to train LLM's that is explicitly NOT Turing Complete (it can be designed to refuse to run code that is not fully parallel to avoid costs in the millions just to have a single cuda run activated, for instance), it can still be just as good at it's dedicated task (training LLM)s.

Another example would be the brain of a new-born baby. I'm pretty sure such a brain is NOT Turing Complete in any way. It has a very short list of training algorithms that are constanly running as it's developing.

But it can't even run Hello World.

For it to really be Turing Complete, it needs to be able to follow instructions accurately (no halucinations, etc) and also access to infinite storage/tape (or it will be a Finite State Machine). Again, it still doesn't matter if it's Turing Complete in this context.


> I don't think you understood what I was writing. I wasn't saying that either the LLM (finished product OR the machines used for training them) were not Turing Complete. I said it was irrelevant.

Why do you think it is irrelevant? It is what allows us to say with near certainty that dismissing the potential of LLMs to be made to reason is unscientific and irrational.

> I'm not talking about shortcuts. When I talk about sacrificing, I'm talking about algorithms that you can run on any Turing Complete machine that are (to our knowledge) fundamentally impossible to distribute properly, regardless of shortcuts.

But again, we've not sacrificed the ability to run those.

> Which, to the degree that it's true, is irrelevant for the reason that I'm saying Turing Completeness is a distraction. You're not likely to run algorithms that require 10^20 to 10^25 steps within the context of an LLM.

Maybe or maybe not, because today inference is expensive, but we already are running plenty of algorithms that require many runs, and steadily increasing as inference speed relative to network size is improving.

> On the other hand, if you make a cluster to train LLM's that is explicitly NOT Turing Complete (it can be designed to refuse to run code that is not fully parallel to avoid costs in the millions just to have a single cuda run activated, for instance), it can still be just as good at it's dedicated task (training LLM)s.

And? The specific code used to run training has no relevance to the limitations of the model architecture.


See my other response above, I think I've identified what part of my argument was unclear.

The update may still have claims in it that you disagree with, but those are specific and (at some point in the future) probably testable.


First I would like to thank you for bein patient with me. After some contemplation, I think I've identified what aspect of my argument hasn't been properly justified, which causes this kind of discussion.

Let's first define C as the set of all algorithms that are computable by any Turing complete system.

The main attractive feature of Turing Completeness is specifically this universality. You can take an algorithm running on one Turing Complete system and port it to another, with some amount of translation work (often just a compiler)

Now let's define the subset of all algorithms C that we are not to properly parallelize, let's label it U. (U is a subset of C).

The complementary subset of C, that CAN be parallelized properly we label P (P is also a subset of C).

Now define algorithms that require a lot of compute (>= 10^20 steps or so) as L. L is also a subset of C.

The complementary ("small" computations) can be labelled S (< 10^20 steps, though the exact cutoff is a bit arbitrary).

Now we define the intersections of S, L, U, P:

(Edit: changed union to intersect)

S_P (intersect of S and P)

L_P (intersect of L and P)

S_U (intersect of S and U)

L_U (intersect of L and U)

For S_P and S_U, the advantages of Turing Completeness remains. L_U is going to be hard to compute on any Turing Complete system.

(Edit: The mistake in my earlier argument was to focus on L_U. L_U is irrelevant for the relevance of the universality of Turing Complete systems, since no such system can run such calculations in a reasonable amount of time, anyway. To run algorithms in the L_U domain would require either some fundamental breakthrough in "single threaded" performance, Quantum Computing or some kind of magic/soul, etc)

This leaves us with L_P. There are computations/algorithms that CAN be parallelized, at least in principle. I will only discuss these from here on.

My fundamental claim is as follows:

While algorithms/computations that belong to the L_P set ARE in theory computable on any Turing Complete system, the variation in how long it takes to compute them can vary so much between different Turing Complete systems that this "universality" stops having practical relevance

For instance, let's say two computers K1 and K2 can run one such algorithm (lp_0, for instance) at the same speed. But on other algorithms (lp_1 and lp_2) the difference between how fast those system can run the computation can vary by a large factor (for instance 10^6), often in both directions.

Let's say lp_1 is 10^6 times faster on K1 than on K2, while lp_2 is 10^6 faster on K2 than on K1.

(Edit: note that lp_2 will take (10^6)^2 = 10^12 times longer than lp_1 on K1)

While both these algorithms are in theory computable on both K1 and K2, this is now of very little practical importance. You always want to run lp_1 on K1 and lp_2 on K2.

Note that I never say that either K1 or K2 are (Edit) not Turing complete. But the main attraction of Turing Completeness is now of no practical utility, namely the ability to move an algorithm from one Turing Complete system to another.

Which also means that what you really care about is not Turing Completeness at all. You care about the ability to calculate lp_1 and lp_2 within a reasonable timeframe, days to years, not decades to beyond eons.

And this is why I'm claiming that this is a paradigm shift. The Turing Completeness ideas were never wrong, they just stopped being useful in the kind of computations that are getting most of the attention now.

Instead, we're moving into a reality where computers are to an ever greater degree specialized for a single purpose, while the role of general purpose computers is fading.

And THIS is where I think my criticism of Deutsch is still accurate. Or rather, if we assume that the human brain belongs to the L_P set and strongly depend on it's hardware for doing what it is doing, this creates a strong constraint on the types of hardware that the human mind can conceivably be uploaded to.

And vice versa. While Deutsch tends to claim that humans will be able to run any computation that ASI will run in the future, I would claim that to the extent a human is able to act as Turing Complete, such computations may take more than a lifetime and possibly longer than the time until the heat death of the universe.

And I think where Deutsch goes wrong, is that he thinks that our conscious thoughts are where our real cognition is going on. My intuition is that while our internal monologue operates at around 1-100 operations per second, our subconscious is requiring in the range of gigaflops to exaflops for our minds to be able to operate in real time.


So, the reason I think the argument on Turing completeness matters here, is that if we accept that an LLM and a brain are both Turing complete, then while you're right that there can be Turing complete systems that are so different in performance characteristics that some are entirely impractical (Turings original UTM is an example of a Turing complete system that is too slow for practical use), if they are both Turing complete the brain is then both an existence-proof for the ability of Turing machines to be made to reason and gives us an upper limit in terms of volume and power needed to achieve human-level reasoning.

It may take us a long time to get there (it's possible we never will), and it may take significant architectural improvements (so it's not a given current LLM architectures can compete on performance), but if both are Turing complete (and not more) then dismissing the human ability to do so is

It means those who dismiss LLM's as "just" next token predictors assuming that this says something about the possibility of reasoning don't have a leg to stand on. And this is why the Turing completeness argument matters to me. I regularly get in heated arguments with people who get very upset at the notion that LLMs can possibly ever reason - this isn't a hypothetical.

> And vice versa. While Deutsch tends to claim that humans will be able to run any computation that ASI will run in the future, I would claim that to the extent a human is able to act as Turing Complete, such computations may take more than a lifetime and possibly longer than the time until the heat death of the universe.

If you mean "act as" in the sense of following operations of a Turing-style UTM with tape, then sure, that will be impractically slow for pretty much everything. Our ability to do so (and the total lack of evidence that we can do anything which exceeds Turing completeness) just serves as a simple way of proving we are Turing complete. In practice, we do most things in far faster ways than simulating a UTM. But I agree with you that I don't think humans can compete with computers in the long run irrespective of the type of problem.


Ok, so it looks like you think you've been arguing against someone who doubt that LLM's (and similar NN's) cannot match the capabilities of humans. In fact, I'm probably on the other side from you compared to them.

Now let's first look at how LLM's operate in practice:

Current LLM's will generally run on some compute cluster, often with some VM layer (and sometimes maybe barebone), followed by an OS on each node, and then Torch/TensorFlow etc to synchronize them.

It doesn't affect the general argument if we treat the whole inference system (the training system is similar) as one large Turing Complete system.

Since the LLM's have from billions to trillions of weights, I'm going to assume that for each token produced by the LLM it will perform 10^12 FP calculations.

Now, let's assume we want to run the LLM itself as a Turing Machine. Kind of like a virtual machine INSIDE the compute cluster. A single floating point multiplication may require in the order of 10^3 tokens.

In other words, by putting 10^15 floating point operations in, we can get 1 floating point operation out.

Now this LLM COULD run any other LLM inside it (if we chose to ignore memory limitations). But it would take at minimum in the order of 10^15 times longer to run than the first LLM.

My model of the brain is similar. We have a substrate (the physical brain) that runs a lot of computation, one tiny part of that is the ability that trained adults can get to perform any calculation (making us Turing Complete).

But compared to the number of raw calculations required by the substrate, our capability to perform universal computation is maybe 1 : 10^15, like the LLM above.

Now, I COULD be wrong in this. Maybe there is some way for LLM's to achieve full access to the underlying hardware for generic computation (if only the kinds of computations other computers can perform). But it doesn't seem that way for me, neither for current generation LLM's nor human brains.

Also, I don't think it matters. Why would we build an LLM to do the calculations when it's much more efficient to build hardware specifically to perform such computations, without the hassle of running it inside an LLM?

The exact computer that we run the LLM (above) on would be able to load other LLM's directly instead of using an intermediary LLM as a VM, right?

It's still not clear to me where this is not obvious....

My speculation, though, is that there is an element of sunk cost fallacy involved. Specifically for people my (and I believe) your age that had a lot of our ideas about these topics formed in the 90s and maybe 80s/70s.

Go back 25+ years, and I would agree to almost everything you write. At the time computers mostly did single threaded processing, and naïve extrapolation might indicate that the computers of 2030-2040 would reach human level computation ability in a single thread.

In such a paradigm, every computer of approximately comparable total power would be able to run the same algorithms.

But that stopped being the case around 10 years ago, and the trend seems to continue to be in the direction of purpose-specific hardware taking over from general purpose machines.

Edit: To be specific, the sunk cost fallacy enters here because people have been having a lot of clever ideas that depend on the principle of Turing Completeness, like the ability to easily upload minds to computers, or to think of our mind as a barebone computer (not like an LLM, but more like KVM or a Blank Slate), where we can plug in any kind of culture, memes, etc.


People can communicate each step, and review each step as that communication is happening.

LLMs must be prompted for everything and don’t act on their own.

The value in the assertion is in preventing laymen from seeing a statistical guessing machine be correct and assuming that it always will be.

It’s dangerous to put so much faith in what in reality is a very good guessing machine. You can ask it to retrace its steps, but it’s just guessing at what it’s steps were, since it didn’t actually go through real reasoning, just generated text that reads like reasoning steps.


> since it didn’t actually go through real reasoning, just generated text that reads like reasoning steps.

Can you elaborate on the difference? Are you bringing sentience into it? It kind of sounds like it from "don't act on their own". But reasoning and sentience are wildly different things.

> It’s dangerous to put so much faith in what in reality is a very good guessing machine

Yes, exactly. That's why I think it is good we are supplementing fallible humans with fallible LLMs; we already have the processes in place to assume that not every actor is infallible.


So true. People who argue that we should not trust/use LLMs because they sometimes get it wrong are holding them to a higher standard than people -- we make mistakes too!

Do we blindly trust or believe every single thing we hear from another person? Of course not. But hearing what they have to say can still be fruitful, and it is not like we have an oracle at our disposal who always speaks the absolute truth, either. We make do with what we have, and LLMs are another tool we can use.


> Can you elaborate on the difference?

They’ll fail in different ways than something that thinks (and doesn’t have some kind of major disease of the brain going on) and often smack in the middle of appearing to think.


> People can communicate each step, and review each step as that communication is happening.

Can, but don't by default. Just as LLMs can be asked for chain of thought, but the default for most users is just chat.

This behaviour of humans is why we software developers have daily standup meetings, version control, and code review.

> LLMs must be prompted for everything and don’t act on their own

And this is why we humans have task boards like JIRA, and quarterly goals set by management.


LLMs "don't act on their own" because we only reanimate them when we want something from them. Nothing stops you from wiring up an LLM to keep generating, and feeding it sensory inputs to keep it processing. In other words, that's a limitation of the harness we put them in, not of LLMs.

As for people communicating each step, we have plenty of experiments showing that it's pretty hard to get people to reliably report what they actually do as opposed to a rationalization of what they've actually done (e.g. split brain experiments have shown both your brain halves will happily lie about having decided to do things they haven't done if you give them reason to think they've done something)

You can categorically not trust peoples reasoning about "why" they've made a decision to reflect what actually happened in their brain to make them do something.


A human brain in a vat doesn't act on its own, either.

Maybe I am not understanding the paper correctly, but it seems they tested "state of the art models" which is almost entirely composed of open source <27B parameter models. Mostly 8B and 3B models. This is kind of like giving algebra problems to 7 year olds to "test human algebra ability."

If you are holding up a 3B parameter model as an example of "LLM's can't reason" I'm not sure if the authors are confused or out of touch.

I mean, they do test 4o and O1 preview, but their performance is notablely absent from the paper's conclusion.


It’s difficult to reproducibly test openai models, since they can change from under you and you don’t have control over every hyperparameter.

It would’ve been nice to see one of the larger llama models though.


The results are there, it's just hidden away in the appendix. The result is that those models they don't actually suffer drops on 4/5 of their modified benchmarks. The one benchmark that does see actual drops that aren't explained by margin of error is the benchmark that adds "seemingly relevant but ultimately irrelevant information to problems"

Those results are absent from the conclusion because the conclusion falls apart otherwise.


There isn’t much utility, but tbf the outputs aren’t identical.

One danger is the human assumption that, since something appears to have that capability in some settings, it will have that capability in all settings.

Thats a recipe for exploding bias, as we’ve seen with classic statistical crime detection systems.


Inferring patterns in unfamiliar problems.

Take a common word problem in a 5th grade math text book. Now, change as many words as possible; instead of two trains, make it two different animals; change the location to a rarely discussed town; etc. Even better, invent words/names to identify things.

Someone who has done a word problem like that will very likely recognize the logic, even if the setting is completely different.

Word tokenization alone should fail miserably.


I have noted over my life that a lot of problems end up being a variation on solved problems from another more familiar domain but frustratingly take a long time to solve before realizing this was just like that thing you had already solved. Nevertheless, I do feel like humans do benefit from identifying meta patterns but as the chess example shows even we might be weak in unfamiliar areas.

Learn how to solve one problem and apply the approach, logic and patterns to different problems. In German that's called "Transferleistung" (roughly "transfer success") and a big thing at advanced schools. Or, at least my teacher friends never stop talking about it.

We get better at it over time, as probably most of us can attest.


A lot of LLMs do weird things on the question "A farmer needs to get a bag of grain across a river. He has a boat that can transport himself and the grain. How does he do this?"

(they often pattern-match on the farmer/grain/sheep/fox puzzle and start inventing pointless trips ("the farmer returns alone. Then, he crosses again.") in a way that a human wouldn't)


What proof do you have that human reasoning involves "symbolic logic and abstractions"? In daily life, that is, not in a math exam. We know that people are actually quite bad at reasoning [1][2]. And it definitely doesn't seem right to define "reasoning" as only the sort that involves formal logic.

[1] https://en.wikipedia.org/wiki/List_of_fallacies

[2] https://en.wikipedia.org/wiki/List_of_cognitive_biases


Some very intelligent people, including Gödel and Penrose, seem to think that humans have some kind of ability to arrive directly on correct propositions in ways that bypass the incompleteness theorem. Penrose seems to think this can be due to Quantum Mechanics, Göder may have thought it came frome something divine.

While I think they're both wrong, a lot of people seem to think they can do abstract reasoning for symbols or symbol-like structures without having to use formal logic for every step.

Personally, I think such beliefs about concepts like consciousness, free will, qualia and emotions emerge from how the human brain includes a simplified version of itself when setting up a world model. In fact, I think many such elements are pretty much hard coded (by our genes) into the machinery that human brains use to generate such world models.

Indeed, if this is true, concepts like consciousness, free will, various qualia and emotions can in fact be considered "symbols" within this world model. While the full reality of what happens in the brain when we exercise what we represent by "free will" may be very complex, the world model may assign a boolean to each action we (and others) perform, where the action is either grouped into "voluntary action" or "involuntary action".

This may not always be accurate, but it saves a lot of memory and compute costs for the brain when it tries to optimize for the future. This optimization can (and usually is) called "reasoning", even if the symbols have only an approximated correspondence with physical reality.

For instance, if in our world model somebody does something against us and we deem that it was done exercising "free will", we will be much more likely to punish them than if we categorize the action as "forced".

And on top of these basic concepts within our world model, we tend to add a lot more, also in symbol form, to enable us to use symbolic reasoning to support our interactions with the world.


> While I think they're both wrong, a lot of people seem to think they can do abstract reasoning for symbols or symbol-like structures without having to use formal logic for every step.

Huh.

I don't know bout incompleteness theorem, but I'd say it's pretty obvious (both in introspection and in observation of others) that people don't naturally use formal logic for anything, they only painstakingly emulate it when forced to.

If anything, "next token prediction" seems much closer to how human thinking works than anything even remotely formal or symbolic that was proposed before.

As for hardcoding things in world models, one thing that LLMs do conclusively prove is that you can create a coherent system capable of encoding and working with meaning of concepts without providing anything that looks like explicit "meaning". Meaning is not inherent to a term, or a concept expressed by that term - it exists in the relationships between an the concept, and all other concepts.


> I don't know bout incompleteness theorem, but I'd say it's pretty obvious (both in introspection and in observation of others) that people don't naturally use formal logic for anything, they only painstakingly emulate it when forced to.

Indeed, this is one reason why I assert that Wittgenstein was wrong about the nature of human thought when writing:

"""If there were a verb meaning "to believe falsely," it would not have any significant first person, present indicative."""

Sure, it's logically incoherent for us to have such a word, but there's what seems like several different ways for us to hold contradictory and incoherent beliefs within our minds.


... but I'd say it's pretty obvious (both in introspection and in observation of others) that people don't naturally use formal logic for anything ...

Yes. But some place too much confidence in how "rational" their intuition is, including some of the most intelligent minds the world has seen.

Specifically, many operate as if their intuition (that they treat as completely rational) has some kind of supernatural/magic/divine origin, including many who (imo) SHOULD know better.

While I think (like you do) that this intuition has a lot in common with LLM's and other NN architectures than pure logic, or even the scientific method.


> Some very intelligent people, including Gödel and Penrose, seem to think that humans have some kind of ability to arrive directly on correct propositions in ways that bypass the incompleteness theorem. Penrose seems to think this can be due to Quantum Mechanics, Göder may have thought it came frome something divine.

Did Gödel really say this? It sounds like quite a stretch of incompleteness theorem.

It's like saying because halting problem is undecidable, but humans can debug programs, therefore human brains must having some supernatural power.


Gödel mostly cared about mathematics. And he seems to have believed that human intuition could "know" propositions to be true, even if they could not be proven logically[1].

It appears that he was religious and probably believed in an immaterial and maybe even divine soul [2]. If so, that may explain why he believed that human intuition could be unburdend by the incompleteness theorem.

[1] https://philsci-archive.pitt.edu/9154/1/Nesher_Godel_on_Trut...

[2] https://en.wikipedia.org/wiki/G%C3%B6del%27s_ontological_pro...


Does anyone have a hard proof that language doesn’t somehow encode reasoning in a deeper way than we commonly think?

I constantly hear people saying “they’re not intelligent, they’re just predicting the next token in a sequence”, and I’ll grant that I don’t think of what’s going on in my head as “predicting the next token in a sequence”, but I’ve seen enough surprising studies about the nature of free will and such that I no longer put a lot of stock in what seems “obvious” to me about how my brain works.


> I’ll grant that I don’t think of what’s going on in my head as “predicting the next token in a sequence”

I can't speak to whether LLMs can think, but current evidence indicates humans can perform complex reasoning without the use of language:

> Brain studies show that language is not essential for the cognitive processes that underlie thought.

> For the question of how language relates to systems of thought, the most informative cases are cases of really severe impairments, so-called global aphasia, where individuals basically lose completely their ability to understand and produce language as a result of massive damage to the left hemisphere of the brain. ...

> You can ask them to solve some math problems or to perform a social reasoning test, and all of the instructions, of course, have to be nonverbal because they can’t understand linguistic information anymore. ...

> There are now dozens of studies that we’ve done looking at all sorts of nonlinguistic inputs and tasks, including many thinking tasks. We find time and again that the language regions are basically silent when people engage in these thinking activities.

https://www.scientificamerican.com/article/you-dont-need-wor...


I'd say that's a separate problem. It's not "is the use of language necessary for reasoning?" which seems to be obviously answered "no", but rather "is the use of language sufficient for reasoning?".

> ..individuals basically lose completely their ability to understand and produce language as a result of massive damage to the left hemisphere of the brain. ...

The right hemisphere almost certainly uses internal 'language' either consciously or unconsciously to define objects, actions, intent.. the fact that they passed these tests is evidence of that. The brain damage is simply stopping them expressing that 'language'. But the existence of language was expressed in the completion of the task..


I think the question we're grappling with is whether token prediction may be more tightly related to symbolic logic than we all expected. Today's LLMs are so uncannily good at faking logic that it's making me ponder logic itself.

I felt the same way about a year ago, I’ve since changed my mind based on personal experience and new research.

Please elaborate.

I work in the LLM search space and echo OC’s sentiment.

The more I work with LLMs the more the magic falls away and I see that they are just very good at guessing text.

It’s very apparent when I want to get them to do a very specific thing. They get inconsistent about it.


Pretty much the same, I work on some fairly specific document retrieval and labeling problems. After some initial excitement I’ve landed on using LLM to help train smaller, more focused, models for specific tasks.

Translation is a task I’ve had good results with, particularly mistral models. Which makes sense as it’s basically just “repeat this series of tokens with modifications”.

The closed models are practically useless from an empirical standpoint as you have no idea if the model you use Monday is the same as Tuesday. “Open” models at least negate this issue.

Likewise, I’ve found LLM code to be of poor quality. I think that has to do with being a very experienced and skilled programmer. What the LLM produce is at best the top answer in stack overflow-level skill. The top answers on stack overflow are typically not optimal solutions, they are solutions up voted by novices.

I find LLM code is not only bad, but when I point this out the LLM then “apologizes” and gives better code. My worry is inexperienced people can’t even spot that and won’t get this best answer.

In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”


There have even been times where an LLM will spit out _the exact same code_ and you have to give it the answer or a hint how to do it better

Yeah. I had the same experience doing code reviews at work. Sometimes people just get stuck on a problem and can't think of alternative approaches until you give them a good hint.

> I’ve found LLM code to be of poor quality

Yes. That was my experience with most human-produced code I ran into professionally, too.

> In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”

Yes, that sometimes works with humans as well. Although you usually need to provide more specific feedback to nudge them in the right track. It gets tiring after a while, doesn't it?


What is the point of your argument?

I keep seeing people say “yeah well I’ve seen humans that can’t do that either.”

What’s the point you’re trying to make?


The point is that the person I responded to criticized LLMs for making the exact sort of mistakes that professional programmers make all the time:

> I’ve found LLM code to be of poor quality. I think that has to do with being a very experienced and skilled programmer. What the LLM produce is at best the top answer in stack overflow-level skill. The top answers on stack overflow are typically not optimal solutions

Most professional developers are unable to produce code up to the standard of "the top answer in stack overflow" that the commenter was complaining about, with the additional twist that most developers' breadth of knowledge is going to be limited to a very narrow range of APIs/platforms/etc. whereas these LLMs are able to be comparable to decent programmers in just about any API/language/platform, all at once.

I've written code for thirty years and I wish I had the breadth and depth of knowledge of the free version of ChatGPT, even if I can outsmart it in narrow domains. It is already very decent and I haven't even tried more advanced models like o1-preview.

Is it perfect? No. But it is arguably better than most programmers in at least some aspects. Not every programmer out there is Fabrice Bellard.


But LLMs aren’t people. And people do more than just generate code.

The comparison is weird and dehumanizing.

I, personally, have never worked with someone who consistently puts out code that is as bad as LLM generated code either.

> Most professional developers are unable to produce code up to the standard of "the top answer in stack overflow"

How could you possibly know that?

All these types of arguments come from a belief that your fellow human is effectively useless.

It’s sad and weird.


>> > Most professional developers are unable to produce code up to the standard of "the top answer in stack overflow"

> How could you possibly know that?

I worked at four multinationals and saw a bunch of their code. Most of it wasn't "the top answer in stack overflow". Was some of the code written by some of the people better than that? Sure. And a lot of it wasn't, in my opinion.

> All these types of arguments come from a belief that your fellow human is effectively useless.

Not at all. I think the top answers in stack overflow were written by humans, after all.

> It’s sad and weird.

You are entitled to your own opinion, no doubt about it.


> In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”

These are called "code reviews" and we do that amongst human coders too, although they tend to be less Socratic in nature.

I think it has been clear from day one that LLMs don't display superhuman capabilities, and a human expert will always outdo one in tasks related to their particular field. But the breadth of their knowledge is unparalleled. They're the ultimate jacks-of-all-trades, and the astonishing thing is that they're even "average Joe" good at a vast number of tasks, never mind "fresh college graduate" good.

The real question has been: what happens when you scale them up? As of now it appears that they scale decidedly sublinearly, but it was not clear at all two or three years ago, and it was definitely worth a try.


I do contract work in the LLM space which involves me seeing a lot of human prompts, and its made the magic of human reasoning fall away: Humans are shocking bad at reasoning on the large.

One of the things I find extremely frustrating is that almost no research on LLM reasoning ability benchmarks them against average humans.

Large proportions of humans struggle to comprehend even a moderately complex sentence with any level of precision.


Aren’t prompts seeking to offload reasoning though? Is that really a fair data point for this?

When people are claiming they can't reason, then yes, benchmarking against average human should be a bare minimum. Arguably they should benchmark against below-average humans too, because the bar where we'd be willing to argue that a human can't reason is very low.

If you're testing to see whether it can replace certain types of work, then it depends on where you would normally set the bar for that type of work. You could offload a whole lot of work with something that can reliably reason at below an average human.


Another one!

What’s the point of your argument?

AI companies: “There’s a new machine that can do reasoning!!!”

Some people: “actually they’re not very good at reasoning”

Some people like you: “well neither are humans so…”

> research on LLM reasoning ability benchmarks them against average humans

Tin foil hat says that it’s because it probably wouldn’t look great and most LLM research is currently funded by ML companies.

> Large proportions of humans struggle to comprehend even a moderately complex sentence with any level of precision.

So what? How does that assumption make LLMs better?


The point of my argument is that the vast majority of tasks we carry out do not require good reasoning, because if they did most humans would be incapable of handling them. The point is also that a whole lot of people claim LLMs can't reason, based on setting the bar at a point where a large portion of humanity wouldn't clear it. If you actually benchmarked against average humans, a whole lot of the arguments against reasoning in LLMs would instantly look extremely unreasonable, and borderline offensive.

> Tin foil hat says that it’s because it probably wouldn’t look great and most LLM research is currently funded by ML companies.

They're currently regularly being benchmarked against expectations most humans can't meet. It'd make the models look a whole lot better.


This is the argument that submarines don't really "swim" as commonly understood, isn't it?

I think so, but the badness of that argument is context-dependent. How about the hypothetical context where 70k+ startups are promising investors that they'll win the 50 meter freestyle in 2028 by entering a fine-tuned USS Los Angeles?

And planes doesn't fly like a bird, it has very different properties and many things birds can do can't be done by a plane. What they do is totally different.

Effective next-token prediction requires reasoning.

You can also say humans are "just XYZ biological system," but that doesn't mean they don't reason. The same goes for LLMs.


Take a word problem for example. A child will be told the first step is to translate the problem from human language to mathematical notation (symbolic representation), then solve the math (logic).

A human doesn’t use next token prediction to solve word problems.


But the LLM isn't "using next-token prediction" to solve the problem, that's only how it's evaluated.

The "real processing" happens through the various transformer layers (and token-wise nonlinear networks), where it seems as if progressively richer meanings are added to each token. That rich feature set then decodes to the next predicted token, but that decoding step is throwing away a lot of information contained in the latent space.

If language models (per Anthropic's work) can have a direction in latent space correspond to the concept of the Golden Gate Bridge, then I think it's reasonable (albeit far from certain) to say that LLMs are performing some kind of symbolic-ish reasoning.


Anthropic had a vested interest in people thinking Claude is reasoning.

However, in coding tasks I’ve been able to find it directly regurgitating Stack overflow answers (like literally a google search turns up the code).

Giving coding is supposed to be Claude’s strength, and it’s clearly just parroting web data, I’m not seeing any sort of “reasoning”.

LLM may be useful but they don’t think. They’ve already plateaued, and given the absurd energy requirements I think they will prove to be far less impactful than people think.


The claim that Claude is just regurgitating answers from Stackoverflow is not tenable, if you've spent time interacting with it.

You can give Claude a complex, novel problem, and it will give you a reasonable solution, which it will be able to explain to you and discuss with you.

You're getting hung up on the fact that LLMs are trained on next-token prediction. I could equally dismiss human intelligence: "The human brain is just a biological neural network that is adapted to maximize the chance of creating successful offspring." Sure, but the way it solves that task is clearly intelligent.


I’ve literally spent 100s of hours with it. I’m mystified why so many people use the “you’re holding it wrong” explanation when somebody points out real limitations.

You might consider that other people have also spent hundreds of hours with it, and have seen it correctly solve tasks that cannot be explained by regurgitating something from the training set.

I'm not saying that your observations aren't correct, but this is not a binary. It is entirely possible that the tasks you observe the models on are exactly the kind where they tend to regurgitate. But that doesn't mean that it is all they can do.

Ultimately, the question is whether there is a "there" there at all. Even if 9 times out of 10, the model regurgitates, but that one other time it can actually reason, that means that it is capable of reasoning in principle.


When we've spent time with it and gotten novel code, then if you claim that doesn't happen, it is natural to say "you're holding it wrong". If you're just arguing it doesn't happen often enough to be useful to you, that likely depends on your expectations and how complex tasks you need it to carry out to be useful.

In many ways, Claude feels like a miracle to me. I no longer have to stress over semantics or searching for patterns I can recognize and work with, but I’ve never actually coded them myself in that language. Now, I don’t have to waste energy looking up things that I find boring

The LLM isn't solving the problem. The LLM is just predicting the next word. It's not "using next-token prediction to solve a problem". It has no concept of "problem". All it can do is predict 1 (one) token that follows another provided set. That running this in a loop provides you with bullshit (with bullshit defined here as things someone or something says neither with good nor bad intent, but just with complete disregard for any factual accuracy or lack thereof, and so the information is unreliable for everyone) does not mean it is thinking.

All the human brain does is determine how to fire some motor neurons. No, it does not reason.

No, the human brain does not "understand" language. It just knows how to control the firing of neurons that control the vocal chords, in order to maximize an endocrine reward function that has evolved to maximize biological fitness.

I can speak about human brains the same way you speak about LLMs. I'm sure you can spot the problem in my conclusions: just because the human brain is "only" firing neurons, it does actually develop an understanding of the world. The same goes for LLMs and next-word prediction.


I agree with you as far as the current state of LLMs, but I also feel like we humans have preconceived notions of “thought” and “reasoning”, and are a bit prideful of them.

We see the LLM sometimes do sort of well at a whole bunch of tasks. But it makes silly mistakes that seem obvious to us. We say, “Ah ha! So it can’t reason after all”.

Say LLMs get a bit better, to the point they can beat chess grandmasters 55% of the time. This is quite good. Low level chess players rarely ever beat grandmasters, after all. But, the LLM spits out illegal moves sometimes and sometimes blunders nonsensically. So we say, “Ah ha! So it can’t reason after all”.

But what would it matter if it can reason? Beating grandmasters 55% of the time would make it among the best chess players in the world.

For now, LLMs just aren’t that good. They are too error prone and inconsistent and nonsensical. But they are also sort weirdly capable at lots of things in strange inconsistent ways, and assuming they continue to improve, I think they will tend to defy our typical notions of human intelligence.


reating gradmasters 55% of the time is not good. We've been beating almost 100% of grandmasters in the 90's.

And when even llm's that are good at chess run, I had recently read an article about someone who examined them from this aspect. They said that if the llm fails to come up with a legal move within 10 tries, a random move is played instead. And this was common.

Not even a beginner would attempt 10 illegal moves in a row, after their first few games. The state of LLM chess is laughably bad, honestly. You would guess that even if it didn't play well, it'd at least consistently make legal moves. It doesn't even get to that level.


I don't see why this isn't a good model for how human reasoning happens either, certainly as a first-order assumption (at least).

> A human doesn’t use next token prediction to solve word problems.

Of course they do, unless they're particularly conscientious noobs that are able to repeatedly execute the "translate to mathematical notation, then solve the math" algorithm, without going insane. But those people are the exception.

Everyone else either gets bored half-way through reading the problem, or has already done dozens of similar problems before, or both - and jump straight to "next token prediction", aka. searching the problem space "by feels", and checking candidate solutions to sub-problems on the fly.

This kind of methodical approach you mention? We leave that to symbolic math software. The "next token prediction" approach is something we call "experience"/"expertise" and a source of the thing we call "insight".


Indeed. Work on any project that requires humans to carry out largely repetitive steps, and a large part of the problem involves how to put processes around people to work around humans "shutting off" reasoning and going full-on automatic.

E.g. I do contract work on an LLM-related project where one of the systemic changes introduced - in addition to multiple levels of quality checks - is to force to make people input a given sentence word for word followed by a word from a set of 5 or so, and a minority of the submissions get that sentence correct including the final word despite the system refusing to let you submit unless the initial sentence is correct. Seeing the data has been an absolutely shocking indictment of human reasoning.

These are submissions from a pool of people who have passed reasoning tests...

When I've tested the process myself as well, it takes only a handful of steps before the tendency is to "drift off" and start replacing a word here and there and fail to complete even the initial sentence without a correction. I shudder to think how bad the results would be if there wasn't that "jolt" to try to get people back to paying attention.

Keeping humans consistently carrying out a learned process is incredibly hard.


is that based on a vigorous understanding of how humans think, derived from watching people (children) learn to solve word problems? How do thoughts get formed? Because I remember being given word problems with extra information, and some children trying to shove that information into a math equation despite it not being relevant. The "think things though" portion of ChatGPT o1-preview is hidden from us, so even though a o1-preview can solve word problems, we don't know how it internally computes to arrive at that answer. But we do we really know how we do it? We can't even explain consciousness in the first place.

This argument reminds me the classic "intelligent design" critique of evolution: "Evolution can't possibly create an eye; it only works by selecting random mutations." Personally, I don't see why a "next token predictor" couldn't develop the capability to reason and form abstractions.

After reading the article I am more convinced it does reasoning. The base model's reasoning capabilities are partly hidden by the chatty derived model's logic.

Would that be enough to prove it? If the LLM was trained only on a set of legal moves, isn't it possible that it functionally learned how each piece is allowed to move without learning how to actually reason about it?

Said differently in case I phrased that poorly - couldn't the LLM still learn the it only ever saw bishops move diagonally and therefore only considering those moves without actually reasoning through the concept of legal and illegal moves?


The problem is that the llm don’t learn to play moves from a position, the internet archives contain only game records. They might be building something to represent position internationally but it will not be automatically activated with an encoded chess position.

The ChessPositionRanking project, with help from the Texel chess engine author, tries to prove random positions (that are not obviously illegal) legal by constructing a game ending in the position. If that fails it tries to prove the position illegal. This now works for over 99.99% of randomly generated positions, so one can feed the legal game record found for random legal positions.

Not that I understand the internals of current AI tech, but...

I'd expect that an AI that has seen billions of chess positions, and the moves played in them, can figure out the rules for legal moves without being told?


Statistical 'AI' doesn't 'understand' anything, strictly speaking. It predicts a move with high probability, which could be legal or illegal.

The illegal moves are interesting as it goes to "understanding". In children learning to play chess, how often do they try and make illegal moves? When first learning the game I remember that I'd lose track of all the things going on at once and try to make illegal moves, but eventually the rules became second nature and I stopped trying to make illegal moves. With an ELO of 1800, I'd expect ChatGPT not to make any illegal moves.

How do you define 'understand'?

There is plenty of AI which learns the rules of games like Alpha Zero.

LLMs might not have the architecture to 'learn', but it also might. If it optimizes all possible moves one chess peace can do (which is not that much to learn) it can easily only 'move' from one game set to another by this type of dictionary.


Understanding a rules-based system (chess) means to be able to learn non-probabilistic rules (an abstraction over the concrete world). Humans are a mix of symbolic and probabilistic learning, allowing them to get a huge boost in performance by admitting rules. It doesn't mean a human will never make an illegal move, but it means a much smaller probability of illegal move based on less training data. Asymptotically, performance from humans and purely probabilistic systems converge. But that also means that in appropriate situations, humans are hugely more data-efficient.

> in appropriate situations, humans are hugely more data-efficient

After spending some years raising my children I gave up the notion that humans are data efficient. It takes a mind numbing amount of training to get them to learn the most basic skills.


You could compare childhood with the training phase of a model. Still think humans are not data-efficient ?

Yes, that is exactly the point I am making. It takes many repetitions (epochs) to teach them anything.

Compared to the amount of data needed to train an even remotely impressive 'AI' model , that is not even AGI and hallucinates on a regular basis ? On the contrary, it seems to me that humans and their children are hugely efficient.

> On the contrary, it seems to me that humans and their children are hugely efficient.

Does a child remotely know as much as ChatGPT? Is it able to reason remotely as well?


I'd say the kid knows more about the world than ChatGPT, yes. For starters, the kid has representations of concepts such as 'blue color' because eyes... ChatGPT can answer difficult questions for sure, but overall I'd say it's much more specialized and limited than a kid. However, I also think that's mostly comparing apples and oranges, and that one's judgement about that is very personal. So, in the end I don't know.

A baby learns to walk and talk in 1 year. Compared to the number of PHDs and compute training these models, the baby is so far ahead in efficiency I marvel way more at their pace.

Neither AlphaZero nor MuZero can learn the rules of chess from an empty chess board and a pile of pieces. There is no objective function so there’s nothing to train upon.

That would be like alien archaeologists of the future finding a chess board and some pieces in a capsule orbiting Mars after the total destruction of Earth and all recorded human thought. The archaeologists could invent their own games to play on the chess board but they’d have no way of ever knowing they were playing chess.


AlphaZero was given the rules of the game, but it figured out how to beat everyone else all by itself!

All by itself, meaning playing against itself...

Interestingly, Bobby Fischer did it in the same way. Maybe AlphaZero also hates chess ? :-)


I think the article briefly touch on that topic at some point:

> For one, gpt-3.5-turbo-instruct rarely suggests illegal moves, even in the late game. This requires “understanding” chess. If this doesn’t convince you, I encourage you to write a program that can take strings like 1. e4 d5 2. exd5 Qxd5 3. Nc3 and then say if the last move was legal.

However, I can't say if LLMs fall in the "statistical AI" category.


Likewise with LLM you don’t know if it is truly in the “chess” branch of the statistical distribution or it is picking up something else entirely, like some arcane overlap of tokens.

So much of the training data (eg common crawl, pile, Reddit) is dogshit, so it generates reheated dogshit.


You generalize this without mentioning that there are LLMs which do not just use random 'dogshit'.

Also what does a normal human do? It looks around how to move one random piece and it uses a very small dictionary / set of basic rules to move it. I do not remember me learning to count every piece and its options by looking up that rulebook. I learned to 'see' how i can move one type of chess piece.

If a LLM uses only these piece moves on a mathematical level, it would do the same thing as i do.

And yes there is also absolutly the option for an LLM to learn some kind of meta game.


A system that would just output the most probable tokens based on the text it was fed and trained on the games played by players with ratings greater than 1800 would certainly fail to output the right moves to totally unlikely board positions.

Yes in theory it could. Depends on how it learns. Does it learn by memorization or by learning the rules. It depends on the architecture and the amount of 'pressure' you put on it to be more efficient or not.

How well does it play modified versions of chess? eg, a modified opening board like the back row is all knights, or modified movement eg rooks can move like a queen. A human should be able to reason their way through playing a modified game, but I'd expect an LLM, if it's just parroting its training data, to suggest illegal moves, or stick to previously legal moves.

Its training set would include a lot of randomly generated positions like that that then get played out by chess engines wouldn't it? Just from people messing around andbposting results. Not identical ones, but similarly oddball.

> Here's one way to test whether it really understands chess. Make it play the next move in 1000 random legal positions

Suppose it tries to capture en passant. How do you know whether that's legal?


I feel like you could add “do not capture en passant unless it is the only possible move” to the test without changing what it’s trying to prove—if anything, some small permutation like this might even make it a stronger test of “reasoning capability.” (Personally I’m unconvinced of the utility of this test in the first place, but I think it can be reasonably steelmanned.)

Assigning "understanding" to an undefined entity is an undefined statement.

It isn't even wrong.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: