To some degree, attention is already a mechanism to make computations from previous tokens useful later. (You can think of the KV cache as a representation of the text so far and all the models thoughts on it.) And since language models are trained on sequences end-to-end, I think this is likely to happen. Multi-token prediction encourages this behavior explicitly but only for the small n token window you define.
That said, there are many works attempting to increase the compute utilization of transformer language models (early exit, mixture of depths) and novel architectures (SSMs etc.).
Oh huh. Why not make it stateful, like re-use and compute just the “diff” when you add a new token? Assuming it’s not that easy because each token can affect attention globally.
I think I’ve read something about this but I wonder if you could abstract attention to sentence/page levels and then only recalculate the parts that are relevant.
I.E. the KV cache is 'just' a time saving measure because an LLM goes back and calculates those values anyway. (Which is why per-token compute increases exponentially otherwise)
You're not wrong that you could make an LLM more stateful. There are plenty of ideas for that but it would
a) be far more compute intensive to train and run (especially train)
B)be susceptible to all of the issues that RNNs have.
C) most importantly, it would almost certainly just converge at scale with transformers. Labs run small scale, internal tests of architectures all the time and most of them basically come to this conclusion and abandon it
That said, there are many works attempting to increase the compute utilization of transformer language models (early exit, mixture of depths) and novel architectures (SSMs etc.).