Hacker News new | past | comments | ask | show | jobs | submit login

Intuitively it feels like there ought to be significant similarities between expert layers because there are fundamentals about processing the stream of tokens that must be shared just from the geometry of the problem. If that's true, then identifying a common abstract base "expert" then specialising the individuals as low-rank adaptations on top of that base would mean you could save a lot of VRAM and expert-swapping. But it might mean you need to train from the start with that structure, rather than it being something you can distil to.





Yes, Deepseek introduced this optimisation of a common base "expert" that's always loaded. Llama 4 uses it too.

I had a sneaking suspicion that I wouldn't be the first to think of it.



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: