> 4 bit is absolutely fine. For larger models. For smaller models, about 12B and...

> 4 bit is absolutely fine.

For larger models.

For smaller models, about 12B and below, there is a very noticeable degradation.

At least that's my experience generating answers to the same questions across several local models like Llama 3.2, Granite 3.1, Gemma2 etc and comparing Q4 against Q8 for each.

The smaller Q4 variants can be quite useful, but they consistently struggle more with prompt adherence and recollection especially.

Like if you tell it to generate some code without explaining the generated code, a smaller Q4 is significantly more likely to explain the code regardless, compared to Q8 or better.