> They break your input into tokens and then look at the most likely set of output tokens given your input. That's all.
That isn’t right: the pre processor provides a lot of material on strawberries and counting r’s, which is then pretending to the question…and then they predict the next sentence as an answer to the question. The model by itself doesn’t know anything, it it just a statistical processor of context, just tokenizing the question and using the model to predict the answer would actually give you less than a wrong answer, it would be gibberish. It messes up on the question because the context it retrieves based on the question text isn’t useful in producing the correct answer.
>because the context it retrieves based on the question text isn’t useful in producing the correct answer.
"retrieves" is the wrong word. Each token (in GQA a small tuple of tokens is summarized into a single token), becomes an element in the KV cache. The dot product of every token with every other token is taken (matrix multiplication) and then a new intermediate token is produced using softmax() and multiplication by the V matrix. What the attention mechanism is supposed to do is combine two tokens and form a new token. In other words, it is supposed to perform the computation of a function f(a,b) = c.
The attention layer is supposed to see "count r" and "strawberry" and determine the answer 3.
Well, at least in theory. Given the combination "count r" and "r", it is practically guaranteed that the attention mechanism succeeds. What this tells us is that the tokenization of the word "strawberry" is causing the model to fail, since it doesn't actually see the letters on the character level. So it is correct to say that the attention mechanism does not have the correct context to produce the correct answer, but it is wrong to say that "retrieval" is necessary here.
The reason why it doesn't make sense to label what is happening as "reasoning" is that the model does not consider its own limitations and plans around them. Most of the work so far has been to brute force more data and more FLOPS, with the hope that it will just work out anyway. This isn't necessarily a bad strategy as it follows the harsh truth learned from the bitter lesson, but the bitter lesson never told us that we can't improve LLMs through human ingenuity, just that human ingenuity must scale with compute and training data. For example, the human ingenuity of "self play" training (as opposed to synthetic data) works just fine, precisely because it scales so well.
Instead of complaining so much about humans trying to "gotcha" the LLMs, what we really ought to build is an adversarial model that learns to "gotcha" LLMs automatically and include it in the training process.
That isn’t right: the pre processor provides a lot of material on strawberries and counting r’s, which is then pretending to the question…and then they predict the next sentence as an answer to the question. The model by itself doesn’t know anything, it it just a statistical processor of context, just tokenizing the question and using the model to predict the answer would actually give you less than a wrong answer, it would be gibberish. It messes up on the question because the context it retrieves based on the question text isn’t useful in producing the correct answer.