how much VRAM do you need to run quantised 7b model?

RossBencina · on Aug 25, 2023

Rough calculation: typical quantization is 4 bit, so 7b weights fit in in 3.6GB, then my rule of thumb would be 2GB for the activations and attention cache (not usually quantized). So 6 or 8 GB VRAM would probably do it. llama.cpp will let you offload your choice of layers to GPU, so you could probably get quite a way with 4GB.