Still 7B, but now with 32k context. Looking forward to see how it compares with ...

MacsHeadroom · on Dec 8, 2023

Not 7B, 8x7B.

It will run with the speed of a 7B model while being much smarter but requiring ~24GB of RAM instead of ~4GB (in 4bit).

dragonwriter · on Dec 8, 2023

Given the config parametes posted, its 2 experts per token, so the conputation cost per token should be the cost of the conponent that selects experts + 2× cost of a 7B model.

MacsHeadroom · on Dec 8, 2023

Ah good catch. Upon even closer examination, the attention layer (~2B params) is shared across experts. So in theory you would need 2B for the attention head + 5B for each of two experts in RAM.

That's a total of 12B, meaning this should be able to be run on the same hardware as 13B models with some loading time between generations.

stavros · on Dec 8, 2023

Yes, but I also care about "can I load this onto my home GPU?" where, if I need all experts for this to run, the answer is "no".

MacsHeadroom · on Dec 9, 2023

The answer is yes if you have a 24GB GPU. Just wait for 4bit quantization.

Or watch Tim Dettmers, who is releasing code to run Mixtral 8x7b in just 4GB of RAM.

brucethemoose2 · on Dec 8, 2023

We can't infer the actual context size from the config.

Mistral 7B is basically an 8K model, but was marked as a 32K one.

seydor · on Dec 8, 2023

unfortunately too big for the broader community to test. Will be very interesting to see how well it performs compared to the large models

brucethemoose2 · on Dec 8, 2023

Not really, looks like a ~40B class model which is very runnable.

MacsHeadroom · on Dec 8, 2023

It's actually ~13B class at runtime. 2B for attention is shared across each expert and then it runs 2 experts at a time.

So 2B for attention + 5Bx2 for inference = 12B in RAM at runtime.

brucethemoose2 · on Dec 8, 2023

Yeah. I just mean in terms of VRAM usage.

MacsHeadroom · on Dec 9, 2023

Yes, that's what I mean as well.

It's between 7B and 13B in terms of VRAM usage and 70B in terms of performance.

Tim Dettmers (QLoRA creator) released code to run Mixtral 8x7b in 4GB of VRAM. (But it benchmarks better than Llama-2 70B).