Aren't the models already doing this, in a way? We know they can do things like write rhyming poems and song lyrics that do make perfect sense, so at some point the activations must be encoding some sort of overall plan for the upcoming sentences, even if maybe every word isn't predicted yet.
Yes. Otherwise next-token models wouldn't be nearly as good as they are. But the question is how to train these capabilities most efficiently! We had some interesting findings on how with increasing model/dataset scale/data quality, capabilities can move from "only learnable with multi-token prediction" to "indifferent" and "multi-token prediction actually hurts". This depends on the capability itself, induction e.g. matures way earlier in this sense than code generation capabilities.
Is it possible that anti-scaling effect occurs because you are removing some middle layers to free up space for the extra output heads? I only scanned the paper quickly but what happens if you treat the technique as strictly additive and don't keep parameter sizes fixed?
> so at some point the activations must be encoding some sort of overall plan for the upcoming sentences
This isn't obviously the case, compare this "intelligent designer" view with evolution: there was no prior plan for rabbits. it's sufficient to create the appearance of design that sequential steps are simply probabilistically modulated by prior ones.
Consider a continuation of "the cat..." merely a distribution over all possible words suffices to create the illusion of a plan, suppose: "the cat sat..." then, "on.., the..." etc. follow from the training data.
I think there's a strong argument against trying to model entire sentences exactly because the system isn't modelling semantics: one should expect accuracy to drop off a cliff if there is no actual plan. ie., predicting "sat on the mat" from "cat" shouldnt be a valid prediction, because of the infinite number of possible continuations that as a whole is terrible (eg., what about "chased the mouse" etc.). The space of all possible sentences to continue from "the cat" is infinite, which much of that space actually useful; whereas the number of words is very small, very fininte, and many of them not useful.
The only reason that "the cat sat..", "the cat sat on..." is reasonable is because each sequential word can be modulated by the prompt to seem as if planned.
No one is spending $10-50mil building a markov text model of everything ever digitised; if they did so, their performance would approach a basic LLM.
Though, more simply, you can just take any LLM and rephrase it as a markov model. All algorithms which model conditional probability are equivalent; you can even unpack a NN as a kNN model or a decision tree.
They all model 'planning' in the same way: P(C|A, B) is a 'plan' for C following A, B. There is no model of P("A B C" | "A B"). Literally, at inference time, no computation whatsoever is performed to anticipate any future prediction -- this follows both trivially form the mathematical formalism (which no one seems to want to understand); or you can also see this empirically: inference time is constant regardless of prompt/continuation.
The reason 'the cat sat...' is completed by 'on the mat' is that it's maximal that P(on|the cat sat...), P(the|the cat sat on...), P(mat|the cat sat on the...)
Why its maximal is not in the model at all, nor in the data. It's in the data generating process, ie., us. It is we who arranged text by these frequencies and we did so because the phrase is a popular one for academic demonstrations (and so on).
As ever, people attribute "to the data" or worse, "to the LLM" no properties it has.. rather it replays the data to us and we suppose the LLM must have the property that generates this data originally. Nope.
Why did the tape recorder say, "the cat sat on the mat"? What, on the tape or in the recorder made "mat" the right word? Surely, the tape must have planned the word...
>Why it's maximal is not in the model at all, nor the data
>It replays the data to us and we suppose the LLM must have the property that generates this data originally.
So to clarify, what you're saying is that under the hood, an LLM is essentially just performing a search for similar strings in its training data and regurgitating the most commonly found one?
Because that is demonstrably not what's happening. If this were 2019 and we were talking about GPT-2 it would be more understandable but SoTA LLMs can in-context learn and translate entire languages which aren't in their dataset.