I've been looking into using the last hidden layer of an off-the-shelf LLM to help my company with a classification task. The last hidden layer is obviously super rich in semantic information because it has to somehow tell the next layer how to generate the next token prediction. That final layer, in some respects, is discarding valuable context information that the final hidden layer encodes.
I am not surprised at all that Meta was able to generate some positive returns by feeding the last hidden layer back into the model auto-regressively.
The method of training they describe in the paper is really cool. Summarized in Figure 2, they train it with a corpus of step-by-step text instructions and then across multiple stages, they iteratively replace one of the textual steps with a last-hidden-layer embedding and see what the model spits out. The weights are then updated through cross-entropy loss as the additional text tokens are generated once again.
So they're basically rewinding the output, replacing an increasing number of textual steps with hidden state embeddings, and playing it forward as the model gradually learns to do all of its step-by-step thinking using just the hidden state data.
In a way, this might be how humans learn to think through language. Our parents teach us using words and our brain gradually replaces the words with thoughts until we can replicate the action or solve the problem ourselves without anyone guiding us with words.
Indeed, I would not be surprised if OpenAI one day admits that the `o1` model uses the last hidden layer (or some other intermediate layer) to feed the "thought process" that you can watch as it "thinks" about the answer. I suspect that they may take the last hidden layer and feed it back into the front of the `o1` model while also feeding a separate, likely much smaller LLM that generates the "thought process" as language tokens.
In this manner, the model makes use of the rich semantic information encoded at the last hidden layer while informing the user via an extraction of that hidden layer specifically tuned to generate human-legible concepts such as, "I'm considering the impact of converting the units from kilograms to pounds," or whatever.
I don't think it does, because from this paper this kind of backfeeding is apparently quite difficult to train.
I've said it before, but I think it's just something like Quiet-STaR, but simplified. They have a bunch of question answer pairs, many of which are difficult. They generate a lot of tokens from the question (let's say, 3x the length of the expected answer), summarise whatever is generated and reinforce whenever it generates the right answer.
o1 ist most likely just 4o optimized for CoT with some fine tuning or perhaps merely with a dedicated system prompt (which is probably the reason why they don't let you access it in the API) and enforced structured output. In fact you can recreate something very similar using 4o and the right system prompt + structured outputs.
That's certainly possible, but it reminds me a bit of a similar thing I've seen in their UI that rhymes in a way that makes me think otherwise. In the code interpreter tool, you have a little preview of the "steps" it's following as it writes code. This turns out to just be the contents of the last written/streamed comment line. It's a neat UI idea I think - pretty simple and works well. I wouldn't be surprised if that's what's going on with o1 too - the thought process is structured in some way, and they take the headings or section names and just display that.
iirc this is a well supported task iirc called "classification head" instead of "language modeling head" in case anyone else wants to do this as a fine-tuning project
This is intriguing. When I learned that a lot of people do not have inner monologue, I was fascinated by the fact that people can differ on such seemingly fundamental way of being. Maybe those who have it just have a "tee" that pipes into words.
Not all people do have an inner monolog when reading, called subvocalization. Subvocalization is when you are basically reading to yourself inside your head, sounding out each word. It is one of the most common reasons for slow reading speed. Most people do not “need” to subvocalize and can train themselves to process the visual text directly, instead of first converting it to “auditory” information.
I found this out a few years ago and I was shocked that the way I read wasn’t universal. I have since been practicing reducing/eliminating subvocalization and I am getting better, and it allows me to increase my reading speed significantly. It also serves as an excellent example of how different our internal mental processes can be, and how completely unaware we are that there could be any other way to think than our own.
Don't we all experience this from time to time? When I'm focused on solving some mathematical problems I'm not thinking in words, but in concepts. When you are thinking of words you also think of a concept, the only difference is that sometimes there are no words associated to it. Im my opinion, words, sentences are just a label to the thinking process, a translation of what is really going on inside, not the driver of it.
That's true - though I think of an inner monologue as being more "self driven". Perhaps it's just that their mental voices don't spontaneously say anything.
BTW, people found that in-conext instruction is useful for these (for example, directly using the last hidden layer to condition a diffusion model is much worse than encoder-decoder model, but you can add instruction prefix "try to imagine more details with the following text: <prompt>" would enrich the last hidden layer vector to be superior than the encoder-decoder text features. Very interesting stuff.
"...because it has to somehow tell the next layer how to generate the next token prediction." -- This isn't actually true in the case of transformers. Features in the final TF layer at time t in a sequence do not depend on the features in the final TF layer at any other time step. Recurrence in transformers is done "depthwise" via "causally masked" convolutions. Final layer features at time t can depend on penultimate layer features at time t-1, but not on final layer features at time t-1.
you are misunderstanding what the person is saying. They are saying the final hidden layer outputs a vector which has all the information that decides the logits which decide the probabilities of each token in the entire vocabulary. Ie, it is storing a lot of information.
Correct. And although the final layer outputs a softmax of the token probabilities, the model by that point surely has a rich understanding of more than just the next token it wants to predict.
> surely has a rich understanding of more than just the next token it wants to predict
> the last hidden layer is obviously super rich in semantic information
I don't agree that this is obvious, and think it's likely wrong (see the sibling thread [1]). The model has to at some point compress down its prediction for the entire future string of text to a prediction for a single token. There's no prior reason to assume it does this mostly in the final "LM head" linear layer, and the inputs to it don't have to predict anything other than the very next token so there's no reason it should (which is what I think psb217 was getting at), but I'm not familiar with what research has been done into it. On the other hand, processing seems to typically be concentrated in the central layers.
The last hidden layer outputs a vector which is then used to predict the probabilities of every token in the vocabulary, by a single layer (and, in practice now in llama models, this layer is the transpose of the embedding layer).
That vector has a lot of information in it, it's not a debatable thing.
As noted above in parens, look at the llama 3.x models. The space is already shared in some sense. It's called "tied embedding".
> That vector has a lot of information in it, it's not a debatable thing.
Encoding the next token is the minimum possible amount of information it might contain; that's not much information (the distribution over the next token is just a projection from the embedding space). E.g. it would be useless for any classification task.
Various models in production are doing exactly that - training a layer which takes the vector out of the last hidden layer, for classification, in place of the language head. I even have one in production right now doing regression using the output of the last hidden layer....
In the case of llama 3 its 4096 * 16 bit = 8192 bytes of information...that's like 8192 characters of ascii. More than enough for most classification tasks... and if you jsut spend any time thinking about encoding the logits for a vocab of 128k... you'll come to the conclusion it's likely to require at least several hundred bytes (maybe 1000?) to do it in any way that will actually work in practice.
Yup, some tokens are effectively branching decisions. Yann has a whole rant about a shortcoming of LLMs being they take the same compute regardless of the position in a sentence - which isn't great because sometimes you really have a serious decision to make, other times not so much. It also makes you wonder about optimal embedding size - maybe the right size is 10x bigger.
Think of it like this: the final softmax layer is like being forced to pick a single word as your next prediction, while the hidden layer contains all the reasoning and understanding that led to that decision. It's similar to how a human might have a complex thought but needs to reduce it to a single word when speaking.
Many existing applications make use of hidden layers in a transformer to perform useful tasks such as classification. The concept of an “embedding” is simply the output of a hidden layer, after all.
We conducted similar research earlier and successfully improved performance to a level comparable to models with 3x larger layer sizes. https://arxiv.org/html/2409.14199v3 We utilize more computational time in the latent space to achieve better performance. However, this approach introduces greater resistance compared to Chain of Thought (CoT) reasoning in the token space, especially if the number of CoT rounds in the latent space exceeds 20.
I would using the term "better approximation of the data distribution" instead of "reasoning" to describe this kind of process.
I think so. I believe this type of reasoning method, which achieves better results through longer computation time, is very useful on edge devices like mobile phones. Consider a scenario where we only need the model to output a function/action call on the phone; we don't require it to provide an immediate response.
I think of an LLM model as like a crystallised mathematical snapshot of intelligence... like a cell on a microscope slide, a dead and mounted form of output from the living process of intelligence...
This paper makes me wonder whether, in a very fuzzy sense, we could give #LLMs access to some similarly crystallised analog of emotion or emotional valence, below the level of language
"Intelligence" is a continuous process. Without a continuous feedback loop, LLMs will never be more than a compression algorithm we bullied into being a chatbot.
OpenAi as a mega-organism might be intelligent, but the LLMs definitely are not.
The "compressed capture of semantic relationships" is a new thing we don't have a word for.
Funnily enough, there is a mathematical link between data compression and AGI [1]. I believe a paper circulated some time ago that compared gpt2 to gzip, with interesting results.
Would you say with equal confidence that they don't exemplify their intelligence by their ability to repeatedly select an often-successful next action from a set of possible next actions, based on a set of input observations?
It still doesn’t make sense for dogs. It might make some sense given a higher-level goal (hiding a toy under the bed)[1] but it doesn’t make much sense for selecting the goals (“I should hide this toy because the other dog keeps stealing it”). In building an AI dog it doesn’t work to elevate these higher-level goals into individual tokens because real dogs form goals dynamically according to their environment and the set is infinitely large. (Note that LLM agents also badly struggle with this; generating goals token-by-token means their goals have hallucinations.)
[1] It still doesn’t make much sense to view this as a statistical process; dogs can generalize far better than transformers, as perhaps best seen with seeing-eye dogs. I believe dogs’ powers of causal reasoning exceed what is possible from mere surface statistics: e.g. they innately understand object permanence as puppies, whereas transformers still don’t understand it after viewing thousands of dogs’ lifetimes of experience.
I've not been able to find any way to distinguish "mere surface statistics" from the deeper, richer, and more meaningful kind of something that it is meant to be contrasted with, except that "surface statistics" are un-compressed. For example, surface statistics might be the set of output measurements generated by a compact process, such as the positions of planets over time; knowing the laws of gravity means we can generate gigabytes of these statistics correctly and easily, which will accurately match future observations.
But then going the other way, from statistics to a causal model, is just an inverse problem -- just like, say, going from a set of noisy magnetic field measurements at the boundary of a container to a pattern of electric current flow inside a volume, or going from planet positions to orbit shapes and periods to an inverse square law of gravity. Generating a compressed inverse model from surface statistics is exactly the sort of thing that deep learning has proven to be very good at. And by now we've seen no shortage of evidence that LLMs and other deep networks contain stateful world models, which is exactly what you'd expect, because for all their parameters, they aren't nearly big enough to contain an infinitesimal fraction of the statistics they were trained on.
So I think it's overly dismissive to regard LLMs as mere surface statistics.
> So I think it's overly dismissive to regard LLMs as mere surface statistics.
It's literally what they are though.
Yes those probabilities embed human knowledge but that doesn't mean that the LLM itself is intelligent. It's why every LLM today fails at anything that isn't centred around rote learning.
It's what they input and output, but it's not literally what they are. The only way to squeeze that many statistics into a compact model is to curve-fit an approximation of the generating process itself. While it fits stochastic sequences (of any type, but usually text), it's conceptually no different from any other ML model. It's no more surface statistics than a deep neural network trained for machine vision would be.
Meh. I'm sometimes curious the different conversations that are possible in different places, I guess? One sometimes hears from different ppl, but maybe wants cross-talk
Seemed easy, and I thought harmless, tho maybe not
Was it just me who thought that this was _already_ how LLMs worked? I'd always assumed they were -- so to speak -- swimming in their own embeddings space before coming out on the other side with language. But it turns out they're just feeding their own incremental outputs back into themselves, without a memory of the path they took to get there. Yowzer!
There isn't any memory of how it got to where it did because all weights are evaluated all the time. It got there through the entirety of the network. There is no logic, just (mostly) a bunch of multiply-accumulates.
I like the direction of the research of working in latent space but feeding the last layer representation back as a first layer embedding feels sketchy to me. Those layers have different representation space.
> Those layers have different representation space.
Do they? Interpretability techniques like the Logit Lens [1] wouldn't work if this were the case. That author found that at least for GPT-2, the network almost immediately transforms its hidden state into a "logitable" form: you can unproject the hidden state of any layer to see how that layer incrementally refines the next token prediction.
Feeding the last layer back as the input embedding has been done many times, e.g. Transformer-XL. The models are trained like this, it's not like they're taking a pre-trained Llama and just feeding it to itself. It's a simple, computationally cheap mechanism to add feedback.
I read a paper not long ago that showed that deleting, duplicating and reordering layers doesn't actually seem to matter that much and it feeding back is just a kind of re-ordering.
Imo this kind of makes sense - LLMs without a feedback loop can learn to have one themselves by encoding information in the previously generated tokens.
from my understanding that is what they do, see the paper:
> We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments.
I agree the feedback is necessary, and the mechanism simple and cheap, but I don't think is optimal.
Yes, they use a pre-trained model, but they do further training (please correct me if I mis-read, and also I realize my above comment could be interpreted as saying they train a new model entirely from scratch).
> We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments. The learning rate is
set to 1 × 10−4 while the effective batch size is 128. Following Deng et al. (2024), we also reset the optimizer
when the training stages switch.
This was my first thought too. AFAIK each layer encodes different information, and it's not clear that the last layer would be able to communicate well with the first layer without substantial retraining.
Like in a CNN for instance, if you fed later representations back in to the first kernels they wouldn't be able to find anything meaningful because it's not the image anymore, it's some latent representation of the image that the early kernels aren't trained on.
The point is that training regime can force the network to immediately reshape the representation layer (after inputs) depending on whether it is a thought or language context.
Not really. See the literature on sharing lm_head (last matrix multiplication) with the input embedding dict.
Basically, the lm_head (a MxN matrix where M is the dictionary size and N is the internal dimension) can be seen as the dictionary too. You can think that and the softmax over it as compute cosine similarity of the last hidden output w.r.t. input embedding dictionary.
In that sense, they are sharing the representation space.
(BTW, I believe sharing lm_head with input embedding not working as good as separating them, so only mobile focused LLMs do so. So here is that. It would be interesting to experiment if injecting a projection layer like you suggested would improve performance or just red-herring).
Unless you can show an example of humans reasoning solving a problem outside the Turing computable set, there is no rational basis for assuming the brain is anything but a computer, as the very notion that we exceed Turing computability would be revolutionary and utterly mindbending in terms of consequences on a number of fields.
there is no rational basis for assuming the brain is a "computer" in the same way an intel x86 chip is a "computer" or that the universe is a "computer". Using language in this way without defining terms like what even is a computer is folly.
There is no rational basis for assuming it is not, as we have not a single example of a computable function outside the Turing computable set.
The term "computer" has it's original outside of "electronic computer". It used to be a role, a job function. There has been no time in human history where the only computers have been electronic computers.
But, sure, let's be more precise: Any Turing complete system is equivalent to any Turing complete computer and can reasonably be called a computer, but let's also limit it to any system that can not compute functions outside the Turing computable set. We don't know of any such systems that have been shown to compute functions outside the Turing computable set, at all, including brains.
The rational basis for assuming the brain is a computer is that we have not a single shred of evidence that exceeding Turing computability is possible, nor any theory for how to even express a function that is computable for humans but not Turing computable.
If you can find one single such example, there'd be a rational basis for saying the brain isn't a computer. As it stands now, assuming it isn't, is nothing more than blind faith.
.
The reason a lot of people are unhappy about this notion is that it doesn't really matter: Any Turing complete system can emulate any other Turing complete system, and an LLM can trivially be made to execute a Turing machine if you put a loop around it, which means that unless you can find evidence humans exceed Turing computability AGI is "just" a question of scaling and training.
It could still turn out to be intractable without a better architecture, but the notion that it might not be impossible makes a lot of people very upset, and the only way it can be impossible even for just an LLM with a loop bolted on is if human brains can compute functions outside the Turing computable set.
"Llm thinks" is false advertising. (Maybe useful jargon, but still)
> Any Turing complete system can emulate any other Turing complete system, and an LLM can trivially be made to execute a Turing machine if you put a loop around it
Wouldn't it be more efficient to erase the LLM and use underlying hardware as Turing complete system?
BTW. Turing test is just admission that we have now way of defining human level intelligence apart from "you'll know it when you see it".
I agree with you. "Chain of thought" is not reasoning, just like LSD trip isn't.
I think we lack a good formal definition of what (fuzzy) reasoning is. Without it, we will always have some kind of unexplained hallucinations.
I also believe AGI could be implemented as a model that can train models for specific tasks completely autonomously. But that would kill the cash cow, so OpenAI etc. are not interested in developing it.
Yes: for eg BPE, due to how it progressively pushes compound tokens of already seen - hence more common - subtokens to the ‘top’ of the vocab), you can train a model to do regression over vocabulary index for the next token from the current token embedding - using the same single regression model for all layer depths. If you plot mse of token index prediction versus layer depth then you can see that the mse of the prediction decreases steadily per additional layer. This appears to be because token index in eg BPE is actually fairly smooth and so it seems like the model is capable of localizing to the actual correct vocab index as depth increases, so kind of like a fuzzy->discrete refinement as you go deeper in layers https://arxiv.org/abs/2408.13442
It's a good, understandable paper. The main issue with chain-of-thought (which I think is a solid approach, and one that needs to take place) is that we ourselves aren't necessarily 'trained' on chain-of-thought. Yes, we do learn mathematical proofs and reasoning at some point (usually), but most people settle on latent thinking without training, and switch between the two modes naturally. My intuition says we're missing something, but who knows
Perhaps these findings might be indicating that we need more NN layers/attention blocks for performing reasoning. This project circumvented the lack of more trained layers by looping the input through currently trained layers more than once.
Also we may have to look for better loss functions than ones that help us predict the next token to train the models if the objective is reasoning.
"We utilize the last hidden state of the LLM as a representation of the reasoning state (termed "continuous thought")."
Could someone explain the last hidden state of the LLM ? What it shape is and how it is normally used - and why it hasn't been used yet to augment the next input? (which seems logical)
The last hidden state is just the output embedding after N residual layers, e.g. input embedding + res1 + res2 + ...
There's typically an "unembedding layer"/"classification head" that uses this hidden state to produce a softmax distribution over the LLM's vocabulary. In this case, we can think of this as "snapping" the hidden state into a single token and feeding that token into the next position of the autoregressive LLM.
In this sense, the last hidden state _does_ augment the next input. The authors simply propose directly feeding this hidden state into the next step rather than reducing it into a single token—thus, reasoning in continuous latent space rather than discrete token space.
Moreover “snapping” the hidden state to a token is akin to quantization. It’s lossy. By staying in latent space the model can “reason” at “full resolution” without discretization noise.
Sometimes discretization introduces interesting behavior though. Compare for example the logistic map and it's chaotic regime with the simplicity of the logistic ODE. Another example would be quantum mechanics compared to classical mechanics and determinism. The Poincare Conjecture was only interesting for n=3 due to too much connectivity in higher dimensions. Wouldn't it be interesting if consciousness only arose in such a discretized form, a case of incidental complexity and chaos introduced as the result of topological non-triviality from quantization?
Don't forget, non-linearity is fundamental to the whole process, otherwise you'd just have one large linear transformation. Maybe there's a similar role for discretization? :shrug:
Useful information about conceptual relationships and procedure can be captured in the LM head, so there is also potential lossiness when short-circuiting it.
Embeddings aka the last hidden state are the mathematical representation of an input of the model before a separate model (usually the decoder) translates that hidden state to a next token (the generative part in generative ai). Normally, the this step repeats over and over. This novel approach introduces re-using the last hidden state as if it was a token that has been generated thus "evolving" the hidden state over each iteration.
The way the recurrence in this method works -- ie, using last LLM hidden state at previous time step as input token for the next time step -- isn't directly compatible with how recurrence/autoregression is typically handled during LLM training. One of the major strengths of transformers is that they can be trained for recurrence/autoregression (which have sequential dependency) using convolutions (which are embarrasingly parallel). The proposed method requires introducing some sequential dependencies during training that could otherwise be avoided using "causal masking" and convolutions to enforce the correct dependencies between time steps in a sequence. Introducing these sequential dependencies makes training a lot slower.
tldr; the method requires training in a way that loses one of the major benefits of transformers, but maybe in some scenarios that loss is worth it.
I wonder if you would want to use an earlier layer as opposed to the penultimate layer, I would imagine that the LLM uses that layer to "prepare" for the final dimensionality reduction to clean the signal such that it scores well on the loss function.
Seems exactly like what you want. We don't think in plain English, we _rationalize_ our thoughts into English (or whatever language comes out) but they must be more fundamental than language because language is acquired.
Essentially, English is one of many possible encodings of an underlying intuitive, possibly non-symbolic representation.
That's debatable. Language shapes thoughts much more than you might think. Because you learn concepts from language that you could not imagine by yourself until you learned/read about them, so they are in effect very linked to language.
I can also think in images and internal visualizations. Geometric reasoning is also a thing. Musicians can also hear things in their mind - some can write it down, others can play it directly, and in my case I'm not good enough to get it out of my head!
In all cases though these thoughts are kind of tied to representations from the real world. Sort of like other languages via different senses. So yeah, how abstract can our thoughts actually be?
But the thing you learn is not the word 'purple'. You just use the word as the mental scaffolding to build a concept of purple. The word forms a linkage to a deeper embedding, which is further proven by the fact that it's actually slightly different in each mind that has understanding of the concept.
This embedded concept is what is doing the work, the word was just the seed of the understanding and a method by which to convey that understanding to others.
Language is definitely a significant part of thinking, but when I remember how cold it was outside yesterday to figure out if it was colder than today, I'm not bringing words to mind. I'm bringing up some other non-discrete information that I could never precisely encode into words and then factoring that in with the other non-discrete information I'm currently taking in through my senses. Its only after that processing that I encode it as a lossy "It was colder yesterday" statement.
For example, I can think in formal logic. I've learned to do that, and surely my brain takes a step-by-step approach to it, but I've also internalized some of it and I don't think that my proficiency with English has anything to do with it.
I could have learned the same concepts in any other language, but the end result would be the same.
And surely there are many thoughts that can't be expressed purely with words. For example all that is related to qualia. You can think of a color but you can't describe what you see in your mind's eye with words, not in a way that would let a blind person share the same experience. Or try describing "love" without making a similitude. Is love a thought? Or a feeling? Is there a meaningful difference between the two?
>The hypothesis is in dispute, with many different variations throughout its history.[2] The strong hypothesis of linguistic relativity, now referred to as linguistic determinism, is that language determines thought and that linguistic categories limit and restrict cognitive categories. This was a claim by some earlier linguists pre-World War II;[3] since then it has fallen out of acceptance by contemporary linguists.
eh, probably both. Why does it have to be a fight between two schools of thoughts? Thoughts can be across-modal. Some of it can be done in specific language or some could be visual.
(universal grammar peoole hates this somehow, it's weird)
In section 2 they briefly mention studies such as [1] that point out that the token outputs of a chain of thought aren't always entirely faithful to the responses of the models
I'm not sure whether it wouldn't be more reliable to let the model run on latents and try to train a separate latent-reading explainer module that has at least some approximation of what we want as an explicit optimization objective.
Assuming it actually is or has the potential to be better than CoT, from what I gathered from the paper the current results are mostly just more efficient token-wise.
I was thinking abut safety reasons, but also usability. Seems like a pretty big difference to me if you don't understand the chain of thought. How faithful cot are is another question.
> Experiments show that Coconut can effectively augment the LLM on several reasoning tasks.
It really seems like we are building a true intelligence, adding components to different parts of a "brain" until we have something rivalling the human mind. It's exceptionally dangerous and it's remarkable how researchers turn a blind eye to any possible consequences.
Just bear in mind while they are yelling "shut it down" there will be a bunch of commenters with no idea whats happening saying that they are just over reacting
agree. Someone should make sure the next ASI develops an extension to hide the comments in every AI thread 80% full of the brightest minds saying " I tried to build a react app and it totally failed doing it the way I wanted ".
I am not surprised at all that Meta was able to generate some positive returns by feeding the last hidden layer back into the model auto-regressively.
The method of training they describe in the paper is really cool. Summarized in Figure 2, they train it with a corpus of step-by-step text instructions and then across multiple stages, they iteratively replace one of the textual steps with a last-hidden-layer embedding and see what the model spits out. The weights are then updated through cross-entropy loss as the additional text tokens are generated once again.
So they're basically rewinding the output, replacing an increasing number of textual steps with hidden state embeddings, and playing it forward as the model gradually learns to do all of its step-by-step thinking using just the hidden state data.
In a way, this might be how humans learn to think through language. Our parents teach us using words and our brain gradually replaces the words with thoughts until we can replicate the action or solve the problem ourselves without anyone guiding us with words.