- 389 billion parameters and 52 billion activation parameters, capable of handli...

Etheryte · 2024-11-05T20:11:53 1730837513

It's a bit funny to call the 405B reference "significantly larger" than their 389B, while highlighting the fact that their 389B outperforms the 70B.

rose_ann_ · 2024-11-05T20:15:46 1730837746

MoE model with 52 billion activated parameters means its more comparable to a (dense) 70b model and not a dense 405b model

phkahler · 2024-11-05T20:37:37 1730839057

>> MoE model with 52 billion activated parameters means its more comparable to a (dense) 70b model and not a dense 405b model

Only when talking about how fast it can produce output. From a capability point of view it makes sense to compare the larger number of parameters. I suppose there's also a "total storage" comparison too, since didn't they say this is 8bit model weights, where llama is 16bit?

HPsquared · 2024-11-05T20:19:38 1730837978

Does this mean it runs faster or better on multiple GPUs?

chessgecko · 2024-11-05T20:32:41 1730838761

For decode steps it depends on the number of inputs you run at a time. If your batch size is 1 then it runs in line with active params, then as you get to like batch size 8 it runs in line with all params, then as you increase to 128ish it runs like the active params again.

For the context encode it’s always close to as fast as a model with a similar number of active params.

For running on your own the issue is going to be fitting all the params on your gpu. If you’re loading off disk anyways this will be faster but if this forces you to put stuff on disk it will be much slower.

klipt · 2024-11-05T20:14:26 1730837666

It's a whole 4% smaller!