Hacker News new | past | comments | ask | show | jobs | submit login

> individual tokens are routed to different experts

that was AFAIK (not an expert! lol) the traditional approach

but judging by the chart on LLaMa4 blog post, now they're interleaving MoE models and dense Attention layers; so I guess this means that even a single token could be routed through different experts at every single MoE layer!






Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: