I have read many articles about LLMs, and understand how it works in general, bu...

rob_c · on Feb 10, 2025

Simply put the stability of attention based models over non attention based ones.

Google dropped MHA self attention which was a major idea that they showed to work. OpenAI saw and built an empire on FeedForward attention models which are (compared to most alternatives) super stable at generation. DeepSeek showed evidence it's possible to further push these models and use effectively compression in the model design to pass around sufficient information for training. (Hence the Latent) They also did a lot of other cool stuff, but the main "core of the model" difference is this part...

Other than that, the biggest hurdle has been hardware. There's probably no way you could get kit from 2010 without even aes acceleration to evaluate most full-fat mhlffa models let alone train them. There's been a happy convergence of matrix acceleration on GPUs for gaming, graphical and high fidelity stimulation work. This combined with matrix based ML maths combined with high throughput memory advances means we can do what we're doing with llms now.

So inevitable outcome or happy convergence? That's for historians to decide imo. I think it's a bit of both.

amelius · on Feb 10, 2025

I guess nobody really knows why. Everybody just goes with what works, and tries only small variations. It's a bit like alchemy.

rob_c · on Feb 10, 2025

Simply put no other model has the same number of effective skip connections or passes as much information through the model from input to output.

Earlier models had huge bottlenecks in terms of information limits and precision. (Auto encoders Vs uNets for example) And LSTM are still semi unstable.

Why the attention design as posited by Google works so well is part the skip forward and part "now we have enough information and processing power to try this".

It's well motivated but from a first principles up do we expect this to work well, it's a bit less well understood still. And if you're good at that you'll likely get a job offer very quickly.

alecco · on Feb 10, 2025

There is a lot of unpublished work on how to train models. A lot of work is cleaning up the data or making synthetic data. This the secret sauce. It was demonstrated by TinyStories and Phi-X and now the recent work on small data for math reasoning.

rob_c · on Feb 10, 2025

There's a huge effort going into understanding the statistical information in a large corpus of text especially after people have shown you can reduce the language input needed to carefully selected sources which guarantee enough information for training.

The smaller the input for the same quality the quicker/better/faster we can iterate so everyone is pushing to get the minimum viable training time of a decent llm down to allow both ChainOfThought to get cheaper as a concept and to allow for iteration and innovation.

As long as we live in the future aspoused by early OpenAI of huge models on huge GPUs we were going to stagnate. More GPU always means better in this game, but smaller faster models means you can do even more with even less. Now the major players see the innovation heading into the multi llm instance arena which is still dominated by who has the best training and hardware. But I expect to see disruption there too in time.

2-3-7-43-1807 · on Feb 10, 2025

what do you refer to by "other models" not belonging to the "SOTA ones"?

HarHarVeryFunny · on Feb 10, 2025

You mean the history of pre-transformer language models, and reason for the transformer architecture ?

Once upon a time ....

Language modelling in general grew out of attempts to build grammars for natural languages, which then gave rise to statistical approaches to modelling languages based on "n-gram" models (use last n words to predict next word). This was all before modern neural networks.

Language modelling (pattern recognition) is a natural fit for neural networks, and in particular recurrent neural networks (RNNs) seemed like a good fit because they have a feedback loop allowing an arbitrarily long preceding context (not just last n words) to be used predicting the next word. However, in practice RNNs didn't work very well since they tended to forget older context in favor of more recent words. To address this "forgetting" problem, LSTMs were designed, which are a variety of RNN that explicitly retain state and learn what to retain and what to forget, and using LSTMs for language models was common before transformers.

While LSTMs were better able to control what part of their history to retain and forget, the next shortcoming to be addressed was that in natural language the next word doesn't depend uniformly on what came before, and can be better predicted by paying more attention to certain words that are more important in the sentence structure (subjects, verbs, etc) than others. This was addressed by adding an attention mechanism ("Bahdanau attention") that learnt to weight preceding words by varying amounts when predicting the next word.

While attention was an improvement, a major remaining problem with LSTMs was that they are inefficient to train due to their recurrent/sequential nature, which is a poor match for today's highly parallel hardware (GPUs, etc). This inefficiency was the motivation for the modern transformer architecture, described in the "Attention is all you need" paper.

The insight that gave rise to the transformer was that the structure of language is really as much parallel as it is sequential, which you can visualize with linguist's sentence parse trees where each branch of the tree is largely independent of other branches at the same level. This structure suggests that language can be understood by a hierarchy (levels of branches) of parallel processing whereby small localized regions of the sentence are analyzed and aggregated into ever larger regions. Both within and across regions (branches), the successful attention mechanism can be used ("Attention is all you need").

However, the idea of hierarchical parallel processing + attention didn't immediately give rise to the transformer architecture ... The researcher who's idea this was (Jakob Uszkoreit) had initially implemented it using some architecture that I've never seen described, and had not been able to get predictive/modelling performance to beat the LSTM+attention approach that it was hoping to replace. At this point another researcher, Noam Shazeer (now back at Google and working on their Gemini model), got involved and worked his magic to turn the idea into a realization - the transformer architecture - whose language modelling performance was indeed an improvement. Actually, there seems to have been a bit of a "throw the kitchen sink" at it approach, as well as Shazeer's insight as to what would work, so there was then an ablation process to identify and strip away all unecessary parts of this new architecture to essentially give the transformer as we now know it.

So this is the history and reason/motivation behind the transformer architecture (the basis of all of today's LLMs), but the prediction performance and emergent intelligence of large models built using this architecture seems to have been quite a surprise. It's interesting to go back and read the early GPT-1, GPT-2 and GPT-3 papers (ChatGPT was intitally based on GPT-3.5) and see the increasing realization of how capable the architecture was.

I think there are a couple of major reasons why older architectures didn't work as well as the transformer.

1) The training efficiency of the transformer, it's primary motivation, has allowed it to be scaled up to enormous size, and a lot of the emergent behavior only becomes apparent at scale.

2) I think the details of the transformer architecture - interaction of key-based attention with hierarchical processing, etc, somewhat accidentally created an architecture capable of much more powerful learning than it's creators had anticipated. One of the most powerful mechanism in the way trained transformers operate is "induction heads" whereby the attention mechanism of two adjacent layers of the transformer learn to co-operate to implement a very powerful analogical copying operation that is the basis of much of what they do. These induction heads are an emergent mechanism - the result of training the transformer rather than something directly built into the architecture.