For me, the critical thing was that ollama got the GPU offload for Mixtral right...

foxhop · on Jan 25, 2024

vLLM doesn't support quantized models at this time so you need 2x 4090 to run Mixtral.

llama.cpp supports quantized models so that makes sense, ollama must have picked a quantized model to make it fit?

regularfry · on Jan 25, 2024

Eh? The docs say vLLM supports both gptq and awq quantization. Not that it matters now I'm out of the gate, it just surprised me that it didn't work.

I'm currently running nous-hermes2-mixtral:8x7b-dpo-q4_K_M with ollama, and it's offloaded 28 of 33 layers to the GPU with nothing else running on the card. Genuinely don't know whether it's better to go for a harsher quantisation or a smaller base model at this point - it's about 20 tokens per second but the latency is annoying.