From the Mistral docs, it seems they need 24GB which is kind of odd? https://doc...

lerela · on Sept 27, 2023

We have clarified the documentation, sorry about the confusion! 16GB should be enough but it requires some vLLM cache tweaking that we still need to work on, so we put 24GB to be safe. Other deployment methods and quantized versions can definitely fit on 16GB!

brucethemoose2 · on Sept 27, 2023

Shouldn't it be much less than 16GB with vLLM's 4-bit AWQ? Probably consumer GPU-ish depending on the batch size?

sp332 · on Sept 27, 2023

Interesting, and that requirement is repeated on the cloud deployment pages, even the unfinished ones where that is the only requirement listed so far. https://docs.mistral.ai/category/cloud-deployment I wonder if that sliding context window really blows up the RAM usage or something.

sebzim4500 · on Sept 27, 2023

Unless I've misunderstood something, the sliding context window should decrease memory usage at inference compared to normal flash attention.