Hacker News new | past | comments | ask | show | jobs | submit login

The performance loss is because this is RTN quantization I believe. If you use the "4chan version" that is 4bit GPTQ, the performance loss from quantization should be very small.



What's the 4chan version?


See https://github.com/ggerganov/llama.cpp/issues/62 (the related repo was originally posted on 4chan, is all, but the code is on GitHub)





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: