Intuitively it feels like there ought to be significant similarities between exp... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

		regularfry 8 days ago \| parent \| context \| favorite \| on: The Llama 4 herd Intuitively it feels like there ought to be significant similarities between expert layers because there are fundamentals about processing the stream of tokens that must be shared just from the geometry of the problem. If that's true, then identifying a common abstract base "expert" then specialising the individuals as low-rank adaptations on top of that base would mean you could save a lot of VRAM and expert-swapping. But it might mean you need to train from the start with that structure, rather than it being something you can distil to.

phire 8 days ago [–]

Yes, Deepseek introduced this optimisation of a common base "expert" that's always loaded. Llama 4 uses it too.

regularfry 8 days ago | [–]

I had a sneaking suspicion that I wouldn't be the first to think of it.

Join us for AI Startup School this June 16-17 in San Francisco!
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact