Hacker News new | past | comments | ask | show | jobs | submit login

I want to emphasize how fascinating I find that the transform from 16 bit to a 4 bit quantization results in negligible performance loss. That's huge. Is the original FP16 not compressed?

The allowance for this more granular quantization seems to suggest the "bottleneck" is in some other aspect of the system, and maybe until that is addressed, a higher fidelity quantization does not improve performance.

Or maybe it's the relative values/ratio between weights that is important, and as long as the intended ratio between weights can be expressed, the exact precision of the weights themselves may not be important?

Found an interesting paper on this below. There's doubtless heavy research underway in this area

- https://www.researchgate.net/publication/367557918_Understan...




A recent discussion I found on int4, definitely looks like this is the new hotness. Very exciting!

https://news.ycombinator.com/item?id=34404859


In my understanding, at a very high level and omitting many crucial details, the key is that when you have mainly largish matrix multiplications (as in transformers) well-behaved (mean zero uncorrelated random or so) quantization errors cancel out. People do/did experiment with 1 or 2 bit compression of gradients/updates in the context of distributed training, but there it has been generally deemed useful to keep track of compression errors locally.


Very insightful! Now I'm curious what the bottleneck is.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: