Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Simply put the stability of attention based models over non attention based ones.

Google dropped MHA self attention which was a major idea that they showed to work. OpenAI saw and built an empire on FeedForward attention models which are (compared to most alternatives) super stable at generation. DeepSeek showed evidence it's possible to further push these models and use effectively compression in the model design to pass around sufficient information for training. (Hence the Latent) They also did a lot of other cool stuff, but the main "core of the model" difference is this part...

Other than that, the biggest hurdle has been hardware. There's probably no way you could get kit from 2010 without even aes acceleration to evaluate most full-fat mhlffa models let alone train them. There's been a happy convergence of matrix acceleration on GPUs for gaming, graphical and high fidelity stimulation work. This combined with matrix based ML maths combined with high throughput memory advances means we can do what we're doing with llms now.

So inevitable outcome or happy convergence? That's for historians to decide imo. I think it's a bit of both.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: