Further improvements in efficiency need not come from alternative architectures....

PartiallyTyped · on April 18, 2023

There's only so much first order derivatives can do for you.

My bet is on sparsity, lottery tickets and symmetries.

ActorNightly · on April 18, 2023

If you think about it, Transformers were basically a way to just generalize convolution - instead of a fixed kernel shape in the sense of image processing, you now have a learned kernel arbitrary shape. Big advancement in terms of what they allowed, but fundamentally not really a new concept.

While these things represent a fundamental way we store information as humans, these have very little to do with actual reasoning.

My bet is that Hebbian learning is going to see a resurgence. Basically the architecture needs to be able to partition data domains while drawing connections between them, and being able to run internal prediction mechanisms.

PartiallyTyped · on April 18, 2023

If we wanted to generalize this further, attention is 'just' an instance of graph convolution, encoder-only models like Bert are complete graphs, decoder-only models like GPT are still complete graphs but the information flow is not bi-directional as in Bert; instead a node provides information to all subsequent nodes only giving rise to the causal nature.

However, I don't think Hebbian learning will see a resurgence except maybe if it motivates some kind of pruning mechanism.

I think that Sutton was right in 'The bitter lesson', the problem seems to be that we are hitting the limits of what we can do with our compute.