Hacker News new | past | comments | ask | show | jobs | submit login

how much VRAM do you need to run quantised 7b model?



Rough calculation: typical quantization is 4 bit, so 7b weights fit in in 3.6GB, then my rule of thumb would be 2GB for the activations and attention cache (not usually quantized). So 6 or 8 GB VRAM would probably do it. llama.cpp will let you offload your choice of layers to GPU, so you could probably get quite a way with 4GB.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: