Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Still 7B, but now with 32k context. Looking forward to see how it compares with the previous one, and what the community does with it.


Not 7B, 8x7B.

It will run with the speed of a 7B model while being much smarter but requiring ~24GB of RAM instead of ~4GB (in 4bit).


Given the config parametes posted, its 2 experts per token, so the conputation cost per token should be the cost of the conponent that selects experts + 2× cost of a 7B model.


Ah good catch. Upon even closer examination, the attention layer (~2B params) is shared across experts. So in theory you would need 2B for the attention head + 5B for each of two experts in RAM.

That's a total of 12B, meaning this should be able to be run on the same hardware as 13B models with some loading time between generations.


Yes, but I also care about "can I load this onto my home GPU?" where, if I need all experts for this to run, the answer is "no".


The answer is yes if you have a 24GB GPU. Just wait for 4bit quantization.

Or watch Tim Dettmers, who is releasing code to run Mixtral 8x7b in just 4GB of RAM.


We can't infer the actual context size from the config.

Mistral 7B is basically an 8K model, but was marked as a 32K one.


unfortunately too big for the broader community to test. Will be very interesting to see how well it performs compared to the large models


Not really, looks like a ~40B class model which is very runnable.


It's actually ~13B class at runtime. 2B for attention is shared across each expert and then it runs 2 experts at a time.

So 2B for attention + 5Bx2 for inference = 12B in RAM at runtime.


Yeah. I just mean in terms of VRAM usage.


Yes, that's what I mean as well.

It's between 7B and 13B in terms of VRAM usage and 70B in terms of performance.

Tim Dettmers (QLoRA creator) released code to run Mixtral 8x7b in 4GB of VRAM. (But it benchmarks better than Llama-2 70B).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: