Further improvements in efficiency need not come from alternative architectures. They'll likely also come from novel training objectives, optimizers, data augmentations, etc.
If you think about it, Transformers were basically a way to just generalize convolution - instead of a fixed kernel shape in the sense of image processing, you now have a learned kernel arbitrary shape. Big advancement in terms of what they allowed, but fundamentally not really a new concept.
While these things represent a fundamental way we store information as humans, these have very little to do with actual reasoning.
My bet is that Hebbian learning is going to see a resurgence. Basically the architecture needs to be able to partition data domains while drawing connections between them, and being able to run internal prediction mechanisms.
If we wanted to generalize this further, attention is 'just' an instance of graph convolution, encoder-only models like Bert are complete graphs, decoder-only models like GPT are still complete graphs but the information flow is not bi-directional as in Bert; instead a node provides information to all subsequent nodes only giving rise to the causal nature.
However, I don't think Hebbian learning will see a resurgence except maybe if it motivates some kind of pruning mechanism.
I think that Sutton was right in 'The bitter lesson', the problem seems to be that we are hitting the limits of what we can do with our compute.