Hacker News new | past | comments | ask | show | jobs | submit login

From the Mistral docs, it seems they need 24GB which is kind of odd?

https://docs.mistral.ai/llm/mistral-v0.1




We have clarified the documentation, sorry about the confusion! 16GB should be enough but it requires some vLLM cache tweaking that we still need to work on, so we put 24GB to be safe. Other deployment methods and quantized versions can definitely fit on 16GB!


Shouldn't it be much less than 16GB with vLLM's 4-bit AWQ? Probably consumer GPU-ish depending on the batch size?


Interesting, and that requirement is repeated on the cloud deployment pages, even the unfinished ones where that is the only requirement listed so far. https://docs.mistral.ai/category/cloud-deployment I wonder if that sliding context window really blows up the RAM usage or something.


Unless I've misunderstood something, the sliding context window should decrease memory usage at inference compared to normal flash attention.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: