Hacker News new | past | comments | ask | show | jobs | submit login

The announcement says a lot (and has plenty of numbers) but I feel like the most important one is missing: how many GB of GPU memory does this need, quantized and unquantized?

(Searching tells me Llama2-7b unquantized needs close to 15GB; presumably this is similar?)




One parameter is 16 bits == 2 bytes. So a model with 7 billion parameters needs 14GB of RAM for the un-quantized model, plus some overhead for the KV cache and other "working memory" stuff but that should be fairly low for a 7B model. I expect it will work on a 16GB GPU just fine.

Quantized ones are also easy. 8 bits == 1 byte so that's 7GB for the model. 4-bit gets you below 4GB.


From the Mistral docs, it seems they need 24GB which is kind of odd?

https://docs.mistral.ai/llm/mistral-v0.1


We have clarified the documentation, sorry about the confusion! 16GB should be enough but it requires some vLLM cache tweaking that we still need to work on, so we put 24GB to be safe. Other deployment methods and quantized versions can definitely fit on 16GB!


Shouldn't it be much less than 16GB with vLLM's 4-bit AWQ? Probably consumer GPU-ish depending on the batch size?


Interesting, and that requirement is repeated on the cloud deployment pages, even the unfinished ones where that is the only requirement listed so far. https://docs.mistral.ai/category/cloud-deployment I wonder if that sliding context window really blows up the RAM usage or something.


Unless I've misunderstood something, the sliding context window should decrease memory usage at inference compared to normal flash attention.


Its not so straightforward, as theres some overhead aside from the weights, especially with 7B at ~4 bit.

But this is probably capable of squeezing onto a 6GB (or less?) GPU with the right backend.


Llama 7B will squeeze on a 6GB GPU quantized. Maybe even less with EX2 quantization.

Foundational model trainers dont seem to worry about quantization much, they just throw the base model out there and then let the community take care of easing the runtime requirements.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: