To some degree, attention is already a mechanism to make computations from previ...

jacobsimon · on May 1, 2024

Thanks for highlighting the KV cache, I’ve been wondering the same thing and hadn’t come across that or didn’t remember.

edmara · on May 1, 2024

Transformers are still stateless, KV cache is just a compute-saving measure (but otherwise correctly described)

jacobsimon · on May 1, 2024

Oh huh. Why not make it stateful, like re-use and compute just the “diff” when you add a new token? Assuming it’s not that easy because each token can affect attention globally.

I think I’ve read something about this but I wonder if you could abstract attention to sentence/page levels and then only recalculate the parts that are relevant.

edmara · on May 2, 2024

Because attention is all you need.

I.E. the KV cache is 'just' a time saving measure because an LLM goes back and calculates those values anyway. (Which is why per-token compute increases exponentially otherwise)

You're not wrong that you could make an LLM more stateful. There are plenty of ideas for that but it would

a) be far more compute intensive to train and run (especially train)

B)be susceptible to all of the issues that RNNs have.

C) most importantly, it would almost certainly just converge at scale with transformers. Labs run small scale, internal tests of architectures all the time and most of them basically come to this conclusion and abandon it