> individual tokens are routed to different experts
that was AFAIK (not an expert! lol) the traditional approach
but judging by the chart on LLaMa4 blog post, now they're interleaving MoE models and dense Attention layers; so I guess this means that even a single token could be routed through different experts at every single MoE layer!
that was AFAIK (not an expert! lol) the traditional approach
but judging by the chart on LLaMa4 blog post, now they're interleaving MoE models and dense Attention layers; so I guess this means that even a single token could be routed through different experts at every single MoE layer!