What does it mean to have less active parameters (36B) than the full model size (132B) and what impact does that have on memory and latency? It seems like this is because it is an MoE model?
The mixture of experts is kinda like a team and a manager. So the manager and one or two of the team go to work depending on the input, not the entire team.
So in this analogy, each team member and the manager has a certain number of params. The whole team is 132B. The manager and team members running for the specific input add up to 36B. Those will load into memory.
Means that it’s a mixture of experts model with 132B parameters in total, but a subset of 36B parameters are used / selected in each forward pass, depending on the context. The parameters not used / selected for generating a particular token belong to “experts” that were deemed not very good at predicting the next token in the current context, but could be used / selected e.g. for the next token.
That way, at inference-time you get the speed of 36B params because you are only "using" 36B params at a time, but the next token might (and frequently does) need a different set of experts than the one before it. If that new set of experts is already loaded (ie you preloaded them into GPU VRAM with the full 132B params), there's no overhead, and you just keep running at 36B speed irrespective of the loaded experts.
You could theoretically load in 36B at a time, but you would be severely bottlenecked by having to reload those 36B params, potentially for every new token! Even on top of the line consumer GPUs that would slow you down to ~seconds per token instead of tokens per second :)