The announcement says a lot (and has plenty of numbers) but I feel like the most important one is missing: how many GB of GPU memory does this need, quantized and unquantized?
(Searching tells me Llama2-7b unquantized needs close to 15GB; presumably this is similar?)
One parameter is 16 bits == 2 bytes. So a model with 7 billion parameters needs 14GB of RAM for the un-quantized model, plus some overhead for the KV cache and other "working memory" stuff but that should be fairly low for a 7B model. I expect it will work on a 16GB GPU just fine.
Quantized ones are also easy. 8 bits == 1 byte so that's 7GB for the model. 4-bit gets you below 4GB.
We have clarified the documentation, sorry about the confusion! 16GB should be enough but it requires some vLLM cache tweaking that we still need to work on, so we put 24GB to be safe. Other deployment methods and quantized versions can definitely fit on 16GB!
Interesting, and that requirement is repeated on the cloud deployment pages, even the unfinished ones where that is the only requirement listed so far. https://docs.mistral.ai/category/cloud-deployment I wonder if that sliding context window really blows up the RAM usage or something.
Llama 7B will squeeze on a 6GB GPU quantized. Maybe even less with EX2 quantization.
Foundational model trainers dont seem to worry about quantization much, they just throw the base model out there and then let the community take care of easing the runtime requirements.
(Searching tells me Llama2-7b unquantized needs close to 15GB; presumably this is similar?)