Training language models with pause tokens

og_kalu · on Oct 4, 2023

This seems somewhat related to https://arxiv.org/abs/2309.16588. Some Researchers found that Vision Transformers would repurpose low-informative background patches of images for internal computations. They gave them an explicit token for this to better results!

HammadB · on Oct 4, 2023

Yeah, was thinking the same thing. Something deeper to be explored here esp due to the consistency across domains. I wonder how far we can push "externalizing" higher level information the model wants to store while in the forward pass.

mkaic · on Oct 4, 2023

You beat me to it, I was just about to post this here haha. It's fascinating to me that these models can take advantage of "scratch paper" this way!

bradfox2 · on Oct 5, 2023

This likely has a tangent to quantization research from Dettmers et al that corresponds to outlier weight values in models >6B params. Might allude to why the outliers emerge.

ZeroCool2u · on Oct 4, 2023

This idea seems to track with the notion we saw early on in the use of ChatGPT where folks suggested that answers with more 'filler' language seemed to produce higher quality responses. The 'pause' token seems like a better way to do this, but it's still pretty bizarre how well it works.

suddenlybananas · on Oct 4, 2023

It's not that shocking, each token outputted is a fixed 'unit' of computation and so there will necessarily be things which can only be handled if there is more computation. A pause token is one way of allowing the model to spend more time to "consider" the answer.

explaininjs · on Oct 4, 2023

There seems to already exist a speed up while emitting common strings and a noticeable pause when the next token is more ambiguous. I had attributed it to the model being able to short circuit execution when all probabilities were accumulating in the same bucket.

Tiberium · on Oct 4, 2023

If you're talking inference speed, it's probably just because simple words are a single token, but complex words or code are much more tokens so it'll output them "slower" (while in reality each token will take the same time)

explaininjs · on Oct 4, 2023

No, there are significant differences in the per-token time that seem directly correlated with “perplexity” (not necessarily in the exact definition, but so to speak).

explaininjs · on Oct 4, 2023

Who the heck downvoted this. I’ve literally collected this data myself from my production AI applications. But it doesn’t gel with people’s preconceived notions so they downvote ?

zaptrem · on Oct 4, 2023

This might be a side effect of speculative decoding.

astrange · on Oct 4, 2023

Models themselves are a fixed amount of computation, but they could be doing more advanced decoding that would run through it multiple times to try to get more truthful answers.

SergeAx · on Oct 5, 2023

Model does not "consider" answer, it inferes it from training data. If there are "pause" tokens in trainig data, they will be present in inference data with similar density. In other words, "garbage in - garbage out".

airstrike · on Oct 4, 2023

I've gotten good results from asking ChatGPT to "check its own work", although on occasion that seems to throw it into a loop and they can't ever get out of it if it doesn't get the right answer, particularly with quantitative answers that it can sanity check itself

SergeAx · on Oct 5, 2023

Model doesn't "know" "the right" answer. It is statistically guessing, like a bad student with good memory.

airstrike · on Oct 5, 2023

Right, but through sufficient statistical guessing, a capacity to solve problems emerges from its ability to use language, which functions well enough as a proxy for some values of "reasoning", for some set of problems, with varying degrees of accuracy.

circuit10 · on Oct 4, 2023

Or ask it to work step by step

kridsdale3 · on Oct 4, 2023

This prompt was a key step forward in the field of human pedagogy as well.

silveraxe93 · on Oct 4, 2023

Damn that's cool. Can we implement a version of System 1 / System 2 [1] thinking with this?

From the future work section:

> better determining the number of <pause> tokens (perhaps using model confidence)

If the number of pause are learned, using model confidence + regularisation (so that the model doesn't always use the maximum number of pauses), then we effectively have the System 1/2 switch. If it's a task the model has seen tons of times before, it just goes with the first inference pass. If it's low confidence, then it keeps expending more inference passes until it reaches an acceptable confidence threshold.

- [1] https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow

striking · on Oct 4, 2023

Would that be in an effort to more accurately model human thinking? If so, I hate to be the one to tell you that

> Readers of “Thinking: Fast and Slow” should read the book as a subjective account by an eminent psychologists, rather than an objective summary of scientific evidence

https://replicationindex.com/2020/12/30/a-meta-scientific-pe...

silveraxe93 · on Oct 4, 2023

I actually read that one before! Kahneman is overconfident, as we all were before the replication crisis.

While most of the psychology field is crumbling and filled with bullshit (even though most people didn't catch on yet). There's still _some_ truth lying in there.

It might look bad in isolation, but Kahneman's work is one of the few that actually holds up to scrutiny.

This post by Scott Alexander is pretty good in summarising the few good bits left. https://www.astralcodexten.com/p/heres-why-automaticity-is-r...

---

But regarding the initial point. The goal is not to fully emulate human thinking. It might sound wishy-washy, but there's _obviously_ _some_ truth to the system 1/2 model. It's not perfect! But I think it's useful.

We as humans do most decisions without thinking (hard), but we have a way to _switch_ into a more reliable but expensive mode. We see some evidence that artificially inducing LLMs to 'think' more improves their output.

So it stands to reason that adding this capability to a model would make it better. And what better way to do it than swallowing the bitter pill and have that decision be made by the model itself, on a case-by-case basis by learning it from data.

airstrike · on Oct 4, 2023

Just don't give them a bicameral mind and we should be safe.

olalonde · on Oct 4, 2023

I wonder if we will eventually end up with "internal monologue" tokens that are not meant to be part of the output but just perform computation, a bit the way humans think.

thomasahle · on Oct 4, 2023

Lots of people are trying to make this work right now.

stavros · on Oct 4, 2023

Not all humans think like that, though.

x-complexity · on Oct 5, 2023

> Not all humans think like that, though.

Yes, not all people think like that. But that also means that there are people who do think like that, based on the application of the pigeonhole principle on your own statement.

All avenues for internal computation should be done to improve model accuracy & results, including this method.

optimalsolver · on Oct 4, 2023

I find that hard to believe.

olalonde · on Oct 4, 2023

So do I. I think it has more to do with people choosing different words and metaphors to describe their subjective experience.

astrange · on Oct 4, 2023

I'm not sure where to find it, but there's an interesting interview where a researcher asks a fire chief how he makes decisions under pressure, and the chief claims he never makes any decisions.

drakenot · on Oct 5, 2023

"Gary Klein studied how firefighters make decisions in the workplace. To his surprise, the firefighters said that they didn’t make decisions at all, as they didn’t actively choose between any options. They simply acted and reacted based on their previous experiences without trying to come up with different courses of action. From this study, Klein developed a model, called the “recognition primed decision-making model,” which he claims is used in 90% of all situations, not just in a crisis.
Klein’s model consists of two steps:
1. Recognize the course of action that makes the most sense

2. Imagine how this would look in reality"

astrange · on Oct 5, 2023

That's it. Using "in the workplace" to describe a house fire is almost weirder than saying you never make decisions.

stavros · on Oct 4, 2023

I don't have an internal monologue.

Loeffelmann · on Oct 4, 2023

What do you imagine a inner monologue to be like?

stavros · on Oct 4, 2023

A voice in my head that narrates thoughts.

red75prime · on Oct 5, 2023

Let's say you've heard someone say "A voice in my head narrates thoughts". What happens next inside your head?

I, for example, may produce something like the following sequence of internally represented words and non-words "A voice narrates... [imagines what it can feel like] Does it feel external to [non-verbal reference to conversation partner]? It doesn't work like that [implied reference to myself] [pause for introspection] I produce words. It's not like someone else narrates thoughts in my head. [stop producing words to check that] I just think, and one thing leads to another and that's why it's hard to stop the sequence of the words being produced by me and it's not because I can't stop some intrusive voice inside my head" and so on and so forth.

Phonetic realization of the words isn't important if I don't specifically concentrate on it.

wilg · on Oct 4, 2023

Are your thoughts or ideas ever in language before you say or write them?

I think this is the same misunderstanding as “the mind’s eye” https://x.com/stuartjritchie/status/1708871999924179024?s=46

stavros · on Oct 4, 2023

I can narrate them if I want to, but I have to choose to. I don't have the narrator constantly in my head, my default state is "no narrator".

wilg · on Oct 4, 2023

My guess is that’s how everyone works. But we’ll never really know. It’s all analogies like wine tasting notes.

stavros · on Oct 4, 2023

I know plenty of people who constantly have narrated thoughts and find it very hard to stop them.

wilg · on Oct 4, 2023

I believe they are just explaining the same thing with a different analogy. Can you stop thinking thoughts? “Narration” is a metaphor.

stavros · on Oct 4, 2023

They're pretty clear that their thoughts are in language, and mine are not. I'm not sure how those can be the same thing. I can stop thinking thoughts, yes.

olalonde · on Oct 4, 2023

I agree with wilg. For example, what you are describing seems similar to my own subjective experience and yet, I've always thought of myself as having an inner monologue. To me, an inner monologue is just about reasoning in my head with the help of language. You are capable of formulating sentences in your head? Solving riddles? You can articulate how you reached a certain conclusion? etc. Surely not every word that comes out of your mouth comes as a surprise? I don't think having an internal monologue necessarily means having a constant conversation with yourself.

stavros · on Oct 4, 2023

An inner monologue is something specific, and refers to thoughts in the form of language:

https://en.m.wikipedia.org/wiki/Intrapersonal_communication

wilg · on Oct 5, 2023

As I originally mentioned, I understand what you are talking about, but I just don't agree it's a real distinction. I believe people who claim to have an inner monologue and people who claim not to are having more or less the same types of internal experiences, they just describe them differently. As the Wikipedia article you linked to discusses, this sort of thing is difficult or impossible to study because it is a subjective experience within your mind. Since most of the research angles involve self-reported studies, it's highly likely that these differences are illusions created by trying to explain mental experiences with language that cannot capture it.

From the tweet I linked earlier:

> This is just an issue of these things being extremely hard to explain using language, and people using words in subtly different ways.

kaibee · on Oct 4, 2023

It would be really funny if the whole "Go step by step and think it through" trick only worked because it was adding more tokens for the AI to reuse.

og_kalu · on Oct 4, 2023

I'm sure that's part of it but also in the same way out of distribution tokens can derail its output, certain tokens will guide it.

In the paper, the model doesn't really learn to use padding tokens effectively till after training so it's the only thing.

3abiton · on Oct 4, 2023

Is this the same as pause tokens?

justinzollars · on Oct 4, 2023

is this your prompt? "Go step by step and think it through".

joshka · on Oct 7, 2023

This seems like a slight variation of "Let’s think step by step" [1] or "Let’s work this out in a step by step way to be sure we have the right answer" [2]. There's a good review of a bunch of these in https://arxiv.org/abs/2305.02897

1: https://arxiv.org/abs/2205.11916

2: https://openreview.net/forum?id=92gvk82DE-

danbruc · on Oct 4, 2023

It does. Those models are glorified look-up tables, output = answers[input]. It would of course be insane to try to store this mapping explicitly but it is still there implicitly in the weight of the network. But no matter what, it is just feed-forward, the system will always be limited when it comes to learning things that require iteration. This can be overcome to some extend by doing something like chain of thought that allows dumping some state into the output stream that will then be fed back through the autoregressive loop wrapping the network. Pause tokens are more or less the same thing just that they do not pollute the output with the internal thoughts of the network.

og_kalu · on Oct 4, 2023

Filler tokens don't work anywhere near as well on GPT-4 as CoT so that's obviously not the only reason. https://www.lesswrong.com/posts/oSZ2xTxEMZh9f3Yaz/llms-are-m...

danbruc · on Oct 4, 2023

I did not read the entire linked article but

For instance, I tested whether asking a model to produce "a b c d e" before solving a math problem improves performance.

is of course not what you would expect to work. It is not about dumping meaningless stuff into the input but intermediate results. If you have to add two long numbers but are only capable to add single digits, you want to dump the carry into the output and get this fed back as input so that you can take it into account when adding up the next two digits. Dumping unrelated stuff into the output will of course not help. Chain of thought does exactly this, it solves part of the overall problem and feeds that back for consideration when tackling the next piece of the problem.

og_kalu · on Oct 4, 2023

Yes and I'm saying even that doesn't work as robustly as CoT across different tasks. CoT is so obviously not just about padding for extra compute.

Certain tokens will guide it to more accurate responses just as much as others derail it out of distribution. This should be obvious.

Literally this paper, you can see the model only really learns how to use the padding tokens effectively after training.

danbruc · on Oct 4, 2023

Yes and I'm saying even that doesn't work as robustly as CoT across different tasks.

What does not work?

CoT is so obviously not just about padding for extra compute.

In a certain way it is but the crucial point is that an autoregressive neural network is pretty much a recurrent neural network and chain of thought encourages the use of the feedback path.

og_kalu · on Oct 4, 2023

>What does not work?

I'm saying that even if you do what you suggest, it simply doesn't work that well. If you read the paper, you'd see extra computation from padding is not used effectively till after training.

Yes, extra compute helps with CoT but it's not even the reason it works. I believe CoT works moreso because it nudges the computation in the right(er) direction.

danbruc · on Oct 4, 2023

List the digit sum mod 7 of the first 10 primes larger than 53.

This starts with 0 0 6 1. What do you have to do to generate the next number? Count the numbers in the output to figure out that you have done 4 numbers, add 1 because the 5th prime after 53 is next, figure out that this prime is 73, calculate the digit sum 10 and then find the remainder 3. There is just no way that a neural network has learned to do this.

But ask for this as a table with prime, digit sum and digit sum mod 7 and it becomes trivial. First prime after 53? Output 59. Digit sum? Output 14. Mod 7? Output 0. Next prime after 59? Output 61. Digit sum? Output 7. Mod 7? Output 0. Next prime after 61? Output 67. Digit sum? Output...

This is what chain of thought does, it allows the neural network to only do simple tasks it knows how to do, output the result and continue from there with the next step. You would keep those intermediate results in your short-term memory but a large language model has none and so the best thing it can do is to dump them into the output and read them back when generating the next token.

The proposed pause token has essentially the same goal, the neural network can dump some state into the output to work on it but this output is not considered part of the response. It is essentially an attempt to hide the chain of thought process and only respond with the final answer. The hard part will of course be to train the neural network to make efficient use of this, to actually output useful intermediate results before the pause is over and the response gets extracted.

Nevermark · on Oct 4, 2023

I propose four training stages:

1. Train without pauses for fast response

2. Train with variable numbers of inserted pauses, so the model can take more time (computation) to improve its performance.

Where pauses are represented both by pause tokens, and a new pause countdown element.

I.e:

Sequence:

  …, token, pause, pause, pause, token, …

Pause element:

  …, 0, 3, 2, 1, 0, …

3. Train the model model to generate its own next pause value (insert its own pauses), to optimize a meta-performance tradeoff function that ranges from the regular performance measure (independent of delay) vs. delay.

The tradeoff is based on an “urgency” value between 0 and 1, that adjusts the tradeoff meta-performance weighting between just delay-independent accuracy vs. just speed.

The urgency is used at each step to calculate the tradeoff meta-performance, and as a new urgency input element so the model knows what tradeoff level it is supposed to optimize.

4. Train the model to generate its own urgency values, based on examples of prompts which either communicate urgency based on request or context.

  “Quick, give me an estimated value for …”

  “The reactor is about to explode, what are the shutdown codes?”

Note, that the model generated urgency value and the next urgency input don’t have to be the same.

The generated urgency could be pushed higher or lower algorithmically, to get the next urgency input, if the model is being used in a context that has its own urgency relevant information.

—-

So many good ideas. It’s clear models are going to improve quickly.

Next step beyond a model that can leverage pauses for better performance, and a model that controls its own pauses, is a model that can be interrupted.

Interruptions could be urgency related, cancellations, modifications, or irrelevant:

  “Hurry up”

  “Forget that, do this other thing.”

  “Oh, and the answer should be odd.”

  “While you think, I am enjoying a nice cup of tea.”

iNic · on Oct 4, 2023

Reminds me of PonderNet [1], which dynamically changed how often the network did a calculation.

[1]: https://arxiv.org/abs/2107.05407

toxik · on Oct 4, 2023

This isn’t exactly news. The question is if it’s better use of FLOPs than just having a larger network or embedding.

og_kalu · on Oct 4, 2023

Each token outputted is a fixed 'unit' of computation. There will always be predictions that could use more computation and therefore need more tokens, regardless of how large the model gets. It just makes sense to add this option.

toxik · on Oct 4, 2023

I think I don’t understand what you are trying to say. Yes, FLOPs per token is the number we care about. Adding ~10x FLOPs and saying “look it’s better” seems odd.

What’s worse, memory use is quadratic in sequence length so adding 10x tokens is just not very smart.

og_kalu · on Oct 4, 2023

>I think I don’t understand what you are trying to say. Yes, FLOPs per token is the number we care about. Adding ~10x FLOPs and saying “look it’s better” seems odd.

You don't have to use 10. That's just what they found optimal for a particular benchmark. For others it was lower.

>What’s worse, memory use is quadratic in sequence length so adding 10x tokens is just not very smart.

Everyone uses flash attention these days. Memory is no longer quadratic in scaling.

toxik · on Oct 5, 2023

You still haven’t answered my one question: how is this different than just increasing embedding size? I can even think of a sketch proof to show this equivalence for single heads.

The paper should have answered this question and well. It is an obvious question.

WRT FlashAttention, surely you understand that the matrix-matrix multiplication must still happen even if it’s a fused kernel that doesn’t need to hold several intermediary variables in memory.

og_kalu · on Oct 5, 2023

Increasing embedding size doesn't make the model allocate arbitrary extra compute only when it needs to. The extra compute from pause tokens is situationary, arising only when necessary and in amounts necessary. Increasing embedding size would be the wasteful option here, not pause tokens.

>WRT FlashAttention, surely you understand that the matrix-matrix multiplication must still happen

Of course it happens. The point is that it doesn't scales linearly now with respect to memory.

toxik · on Oct 5, 2023

This is simply not true, the pause token insertion is not situation dependent. If the model could say "please run me again", that would be groundbreaking and it has been looked into. The problem is how do you train such a model? It would have to be reinforcement learning or something. Your set of allowed target sequences is infinite.

I don't understand. You have an S by S matrix. You increase S, the matrix grows quadratically. There is no way around this.

PartiallyTyped · on Oct 4, 2023

Seems related to “infinite context” by adding attention sinks. This seems to be the same mechanism at play.

https://arxiv.org/abs/2309.17453.pdf

intalentive · on Oct 5, 2023

An attention layer is a kind of meta neural network whose weights depend on input. If you think of each token as a neuron and the attention matrix as defining the weights, then extra tokens simply mean a wider network.

SergeAx · on Oct 5, 2023

Pause have no meaningful value for language models. If pause tokens are produced from speech analysis, LLMs will just become better at imitating speaking, not smarter.

odyssey7 · on Oct 4, 2023

Woah, very cool idea.