Claude doesn't know why it acted the way it acted, it is only *predicting* why i...

LoganDark · 2026-01-26T16:57:47 1769446667

It's not even predicting why it acted, it's predicting an explanation of why it acted, which is even worse since there's no consistent mental model.

GuB-42 · 2026-01-27T11:37:32 1769513852

It had been shown that LLMs don't know how they work. They asked a LLM to perform computations, and explain how they got to the result. The LLM explanation is typical of how we do it: add number digit by digit, with carry, etc... But by looking inside the neural network, it show that the reality is completely different and much messier. None of it is surprising.

Still, feeding it back its own completely made up self-reflection could be an effective strategy, reasoning models kind of work like this.

FireBeyond · 2026-01-27T18:55:28 1769540128

Right. Last time I checked this was easy to demonstrate with word logic problems:

"Adam has two apples and Ben has four bananas. Cliff has two pieces of cardboard. How many pieces of fruit do they have?" (or slightly more complex, this would probably be easily solved, but you get my drift.)

Change the wordings to some entirely random, i.e. something not likely to be found in the LLM corpus, like walruses and skyscrapers and carbon molecules, and the LLM will give you a suitably nonsensical answer showing that it is incapable of handling even simple substitutions that a middle schooler would recognize.

phpnode · 2026-01-27T12:46:52 1769518012

The explanation becomes part of the context which can lead to more effective results in the next turn, it does work, but it does so in a completely misleading way

wongarsu · 2026-01-27T14:16:10 1769523370

Which should be expected, since the same is true for humans. The "adding numbers digit by digit with carry" works well on paper, but it's not an effective method for doing math in your head, and is certainly not how I calculate 14+17. In fact I can't really tell you how I calculate 14+17 since that's not in the "inner monologue" part of my brain, and I have little introspection in any of the other parts

Still, feeding humans their completely made-up self-reflection back can be an effective strategy

GuB-42 · 2026-01-27T17:57:00 1769536620

The difference is that if you are honest and pragmatic and someone asked you how you added two numbers, you would only say you did long addition if that's what you actually did. If you had no idea what you actually did, you would probably say something like "the answer came to me naturally".

LLMs work differently. Like a human, 14+17=31 may come naturally, but when asked about their though process, LLMs will not self-reflect on their condition, instead they will treat it like "in your training data, when someone is asked how he added number, what follows?", and usually, it is long addition, so that is the answer you will get.

It is the same idea as to why LLMs hallucinate. They will imitate what their dataset has to say, and their dataset doesn't have a lot of "I don't know" answers, and a LLM that learns to answer "I don't know" to every question wouldn't be very useful anyways.

sigmoid10 · 2026-01-27T18:58:31 1769540311

>if you are honest and pragmatic and someone asked you how you added two numbers, you would only say you did long addition if that's what you actually did. If you had no idea what you actually did, you would probably say something like "the answer came to me naturally".

To me that misses the argument of the above comment. The key insight is that neither humans nor LLMs can express what actually happens inside their neural networks, but both have been taught to express e.g. addition using mathematical methods that can easily be verified. But it still doesn't guarantee for either of them not to make any mistakes, it only makes it reasonably possible for others to catch on to those mistakes. Always remember: All (mental) models are wrong. Some models are useful.

estimator7292 · 2026-01-27T15:46:03 1769528763

Life lesson for you: the internal functions of every individual's mind are unique. Your n=1 perspective is in no way representative of how humans as a category experience the world.

Plenty of humans do use longhand arithmetic methods in their heads. There's an entire universe of mental arithmetic methods. I use a geometric process because my brain likes problems to fit into a spatial graph instead of an imaginary sheet of paper.

Claiming you've not examined your own mental machinery is... concerning. Introspection is an important part of human psychological development. Like any machine, you will learn to use your brain better if you take a peek under the hood.

wongarsu · 2026-01-27T15:55:38 1769529338

> Claiming you've not examined your own mental machinery is... concerning

The example was carefully chosen. I can introspect how I calculate 356*532. But I can't introspect how I calculate 14+17 or 1+3. I can deliberate the question 14+17 more carefully, switching from "system 1" to "system 2" thinking (yes, I'm aware that that's a flawed theory), but that's not how I'd normally solve it. Similarly I can describe to you how I can count six eggs in a row, I can't describe to you how I count three eggs in a row. Sure, I know I'm subitizing, but that's just putting a word on "I know how many are there without conscious effort". And without conscious effort I can't introspect it. I can switch to a process I can introspect, but that's not at all the same

kaffekaka · 2026-01-26T16:54:54 1769446494

Yes, this pitfall is a hard one. It is very easy to interpret the LLM in a way there is no real ground for.

scotty79 · 2026-01-26T23:10:16 1769469016

It must be anthropomorphization that's hard to shake off.

If you understand how this all works it's really no surprise that reasoning post-factum is exactly as hallucinated as the answer itself and might have very little to do with it and it always has nothing to do with how the answer actually came to be.

The value of "thinking" before giving an answer is reserving a scratchpad for the model to write some intermediate information down. There isn't any actual reasoning even there. The model might use information that it writes there in completely obscure way (that has nothing to do what's verbally there) while generating the actual answer.

nnevatie · 2026-01-27T01:30:16 1769477416

That's because when the failure becomes the context, it can clearly express the intent of not falling for it again. However, when the original problem is the context, none of this obviousness applies.

Very typical, and gives LLMs the annoying Captain Hindsight -like behaviour.

nonethewiser · 2026-01-26T17:01:59 1769446919

IDK how far AIs are from intelligence, but they are close enough that there is no room for anthropomorphizing them. When they are anthropomorphized its assumed to be a misunderstanding of how they work.

Whereas someone might say "geeze my computer really hates me today" if it's slow to start, and we wouldn't feel the need to explain the computer cannot actually feel hatred. We understand the analogy.

I mean your distinction is totally valid and I dont blame you for observing it because I think there is a huge misunderstanding. But when I have the same thought, it often occurs to me that people aren't necessarily speaking literally.

amenhotep · 2026-01-26T17:44:24 1769449464

This is a sort of interesting point, it's true that knowingly-metaphorical anthropomorphisation is hard to distinguish from genuine anthropomorphisation with them and that's food for thought, but the actual situation here just isn't applicable to it. This is a very specific mistaken conception that people make all the time. The OP explicitly thought that the model would know why it did the wrong thing, or at least followed a strategy adjacent to that misunderstanding. He was surprised that adding extra slop to the prompt was no more effective than telling it what to do himself. It's not a figure of speech.

zarzavat · 2026-01-26T18:13:04 1769451184

A good time to quote our dear leader:

> No one gets in trouble for saying that 2 + 2 is 5, or that people in Pittsburgh are ten feet tall. Such obviously false statements might be treated as jokes, or at worst as evidence of insanity, but they are not likely to make anyone mad. The statements that make people mad are the ones they worry might be believed. I suspect the statements that make people maddest are those they worry might be true.

People are upset when AIs are anthropomorphized because they feel threatened by the idea that they might actually be intelligent.

Hence the woefully insufficient descriptions of AIs such as "next token predictors" which are about as fitting as describing Terry Tao as an advanced gastrointestinal processor.

jdub · 2026-01-27T02:45:10 1769481910

I'm not threatened by the idea that LLMs might actually be intelligent. I know they're not.

I'm threatened by other people wrongly believing that LLMs possess elements of intelligence that they simply do not.

Anthropomorphosis of LLMs is easy, seductive, and wrong. And therefore dangerous.

antonvs · 2026-01-26T23:49:04 1769471344

The comment you replied to made a point that, if you accept it (which you probably should), makes that PG quote inapplicable here. The issue in this case is that treating the model as though it has useful insight into its own operation - which is being summarized as anthropomorphizing - leads to incorrect conclusions. It’s just a mistake, that’s all.

phpnode · 2026-01-26T17:49:24 1769449764

There's this underlying assumption of consistency too - people seem to easily grasp that when starting on a task the LLM could go in a completely unexpected direction, but when that direction has been set a lot of people expect the model to stay consistent. The confidence with which it answers questions plays tricks on the interlocutor.

nonethewiser · 2026-01-26T17:58:39 1769450319

Whats not a figure of speech?

I am speaking general terms - not just this conversation here. The only specific figure of speech I see in the original comment is "self reflection" which doesn't seem to be in question here.

electroglyph · 2026-01-27T04:55:12 1769489712

some models are capable of metacognition. i've seen Anthropic's research replicated.

lukashahnart · 2026-01-27T05:12:45 1769490765

Can you elaborate on what you mean by metacognition and where you’ve seen it in Anthropic’s models?

drob518 · 2026-01-27T10:13:25 1769508805

It’s not even doing that. It’s just an algorithm for predicting the next word. It doesn’t have emotions or actually think. So, I had to chuckle when it said it was arrogant. Basically, it’s training data contains a bunch of postmortem write ups and it’s using those as a template for what text to generate and telling us what we want to hear.