Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

- 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. - outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model.


It's a bit funny to call the 405B reference "significantly larger" than their 389B, while highlighting the fact that their 389B outperforms the 70B.


MoE model with 52 billion activated parameters means its more comparable to a (dense) 70b model and not a dense 405b model


>> MoE model with 52 billion activated parameters means its more comparable to a (dense) 70b model and not a dense 405b model

Only when talking about how fast it can produce output. From a capability point of view it makes sense to compare the larger number of parameters. I suppose there's also a "total storage" comparison too, since didn't they say this is 8bit model weights, where llama is 16bit?


Does this mean it runs faster or better on multiple GPUs?


For decode steps it depends on the number of inputs you run at a time. If your batch size is 1 then it runs in line with active params, then as you get to like batch size 8 it runs in line with all params, then as you increase to 128ish it runs like the active params again.

For the context encode it’s always close to as fast as a model with a similar number of active params.

For running on your own the issue is going to be fitting all the params on your gpu. If you’re loading off disk anyways this will be faster but if this forces you to put stuff on disk it will be much slower.


It's a whole 4% smaller!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: