Because attention is all you need. I.E. the KV cache is 'just' a time saving mea...

Because attention is all you need.

I.E. the KV cache is 'just' a time saving measure because an LLM goes back and calculates those values anyway. (Which is why per-token compute increases exponentially otherwise)

You're not wrong that you could make an LLM more stateful. There are plenty of ideas for that but it would

a) be far more compute intensive to train and run (especially train)

B)be susceptible to all of the issues that RNNs have.

C) most importantly, it would almost certainly just converge at scale with transformers. Labs run small scale, internal tests of architectures all the time and most of them basically come to this conclusion and abandon it