MoE model with 52 billion activated parameters means its more comparable to a (d...

phkahler · 2024-11-05T20:37:37 1730839057

>> MoE model with 52 billion activated parameters means its more comparable to a (dense) 70b model and not a dense 405b model

Only when talking about how fast it can produce output. From a capability point of view it makes sense to compare the larger number of parameters. I suppose there's also a "total storage" comparison too, since didn't they say this is 8bit model weights, where llama is 16bit?

HPsquared · 2024-11-05T20:19:38 1730837978

Does this mean it runs faster or better on multiple GPUs?

chessgecko · 2024-11-05T20:32:41 1730838761

For decode steps it depends on the number of inputs you run at a time. If your batch size is 1 then it runs in line with active params, then as you get to like batch size 8 it runs in line with all params, then as you increase to 128ish it runs like the active params again.

For the context encode it’s always close to as fast as a model with a similar number of active params.

For running on your own the issue is going to be fitting all the params on your gpu. If you’re loading off disk anyways this will be faster but if this forces you to put stuff on disk it will be much slower.