The announcement says a lot (and has plenty of numbers) but I feel like the most...

sp332 · on Sept 27, 2023

One parameter is 16 bits == 2 bytes. So a model with 7 billion parameters needs 14GB of RAM for the un-quantized model, plus some overhead for the KV cache and other "working memory" stuff but that should be fairly low for a 7B model. I expect it will work on a 16GB GPU just fine.

Quantized ones are also easy. 8 bits == 1 byte so that's 7GB for the model. 4-bit gets you below 4GB.

semi-extrinsic · on Sept 27, 2023

From the Mistral docs, it seems they need 24GB which is kind of odd?

https://docs.mistral.ai/llm/mistral-v0.1

lerela · on Sept 27, 2023

We have clarified the documentation, sorry about the confusion! 16GB should be enough but it requires some vLLM cache tweaking that we still need to work on, so we put 24GB to be safe. Other deployment methods and quantized versions can definitely fit on 16GB!

brucethemoose2 · on Sept 27, 2023

Shouldn't it be much less than 16GB with vLLM's 4-bit AWQ? Probably consumer GPU-ish depending on the batch size?

sp332 · on Sept 27, 2023

Interesting, and that requirement is repeated on the cloud deployment pages, even the unfinished ones where that is the only requirement listed so far. https://docs.mistral.ai/category/cloud-deployment I wonder if that sliding context window really blows up the RAM usage or something.

sebzim4500 · on Sept 27, 2023

Unless I've misunderstood something, the sliding context window should decrease memory usage at inference compared to normal flash attention.

brucethemoose2 · on Sept 27, 2023

Its not so straightforward, as theres some overhead aside from the weights, especially with 7B at ~4 bit.

But this is probably capable of squeezing onto a 6GB (or less?) GPU with the right backend.

brucethemoose2 · on Sept 27, 2023

Llama 7B will squeeze on a 6GB GPU quantized. Maybe even less with EX2 quantization.

Foundational model trainers dont seem to worry about quantization much, they just throw the base model out there and then let the community take care of easing the runtime requirements.