Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes it increases compute usage but your 5090 has a hell of a lot of compute and the decompression algorithms are pretty simple. Memory is the bottleneck here and unless you have a strange GPU which has lots of fast memory but very weak compute a quantized model should always run faster.

If you're using llama.cpp run the benchmark in the link I posted earlier and see what you get; I think there's something like it for vllm as well.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: