This is the take I thought I'd have, but in the last example, the guesser model reaches the correct conclusion using a different reasoning than the clue giver model.
The clue giver justifies the link of Paper and Log as "written records", and between Paper and Line as "lines of text". But the guesser model connects Paper and Log because "paper is made from logs" (reaching the conclusion through a different meaning of Log), and connects Paper and Line because "'lined paper' is a common type of paper".
Similarly, in the first example, the clue giver connects Monster and Lion because lions are "often depicted as a mythical beast or monster in legends" (a tenuous connection if you ask me), whereas the guesser model thought about King because of King Kong (which I also prefer to Lion).
The best available evidence suggests this is also true of any explanations a human gives for their own behaviour; nevertheless we generally accept those at face value.
The explanations I give of my behaviour are post-hoc (unless I was paying attention), but I also assess their plausibility by going "if this were the case, how would I behave?" and seeing how well that prediction lines up with my actual behaviour. Over time, I get good at providing explanations that I have no reason to believe are false – which also tend to be explanations that allow other people to predict my behaviour (in ways I didn't anticipate).
GPT-based predictive text systems are incapable of introspection of any kind: they cannot execute the algorithm I execute when I'm giving explanations for my behaviour, nor can they execute any algorithm that might actually result in the explanations becoming or approaching truthfulness.
The GPT model is describing a fictional character named ChatGPT, and telling you why ChatGPT thinks a certain thing. ChatGPT-the-character is not the GPT model. The GPT model has no conception of itself, and cannot ever possibly develop a conception of itself (except through philosophical inquiry, which the system is incapable of for different reasons).
Of course! If you’ve played Codenames and introspected on how you play you can see this in action. You pick a few words that feel similar and then try to justify them. Post-hoc rationalization in action.
Yes and you may search for other words that fit the rationalization to decide whether or not it's a good one. You can go even further if your teammates are people you know fairly well by bringing in your own knowledge of these people and how they might interpret the clues. There's a lot of strategy in Codenames and knowledge of vocabulary and related words is only part of it.
If an LLM states an answer and then provides a justification for that answer, the justification is entirely irrelevant to the reasoning the bot used. It might be that the semantics of the justification happen to align with the implied logic of the internal vector space, but it is best case a manufactured coincidence. It’s not different from you stating an answer and then telling the bot to justify it.
If an LLM is told to do reasoning and then state the answer, it follows that the answer is basically guaranteed to be derived from the previously generated reasoning.
The answer will likely match what the reasoning steps bring it to, but that doesn’t mean the computations by the LLM to get that answer are necessarily approximated by the outputted reasoning steps. E.g. you might have an LLM that is trained on many examples of Shakespearean text. If you ask it who the author of a given text is, it might give some more detailed rationale for why it is Shakepeare, when the real answer is “I have a large prior for Shakespeare”.
Yes, the reason is that the model assigns words positions in an ever-changing vector space and evaluates relation by their correspondence in that space—the reply it gives is also a certain index of that space, with the “why” in the question giving it the weight of producing an “answer.”
Which is to say that “why” it gives those answers is because its statistically likely within its training data that when there are the words, “why did you connect line and log with paper” the text which follows could be “logs are made of wood and lines are in paper.” But that is not the specific relation of the 3 words in the model itself, which is just a complex vector space.
I definitely think it's doing more than that here (at least inside of the vector-space computations). The model probably directly contains the paper-wood-log association.
generally there is a "temperature" parameter that can be used to add some randomness or variety to the LLMs outputs by changing the likelihood of the next word being selected. This means you could just keep regenerating the same response and get different answers each time. each time it will give different plausible responses, and this is all from the same model. This doesn't mean it believes any of them, it just keeps hallucinating likely text, some of which will fit better than others. It is still very much the same brain (or set of trained parameters) playing with itself.
Yeah not sure what’s impressive about this. Having the model be both the guesser and clue giver will of course have good results as it’s simply a reflections of o1’s weighting of tokens.
Interestingly this could be a way to potentially reverse engineer o1’s weightings
It’s basically the same brain playing with itself. Seems quite natural to link the code names to the same words.
Let different LLMs play.