Hacker News new | past | comments | ask | show | jobs | submit login
SVDQuant+NVFP4: 4× Smaller, 3× Faster FLUX with 16-bit Quality on Blackwell GPUs (hanlab.mit.edu)
48 points by lmxyy 14 hours ago | hide | past | favorite | 8 comments





SVDQuant supports NVFP4 on NVIDIA Blackwell GPUs with 3× speedup over BF16 and better image quality than INT4. Try our interactive demo below or at https://svdquant.mit.edu/! Our code is all available at https://github.com/mit-han-lab/nunchaku!

I assume they've messed up the prompt caption for the squirrel-looking creature?

Interesting to see how poor the prompt adhesion is in these examples. The cyanobacteria one is just "an image of the ocean". The skincare one completely ignores 50% of the ingredients in the prompt, and makes coffee beans the size and shape of almonds.


I thought I'd already seen this in the previous discussion 3 months ago https://news.ycombinator.com/item?id=42093112 but that one used INT4 quantization, so NVFP4 is a further improvement on that. Sweet!

If I found the correct docs https://docs.nvidia.com/deeplearning/cudnn/frontend/latest/o... NVFP4 means 16 4-bit floating-point values (1 sign bit, 2 for the exponent, 1 for the mantissa) each have one shared 8-bit floating point scaling factor (1 sign bit, 4 exponent, 3 mantissa), so strictly speaking it's 4.5 bits per value.

This grouped scaling immediately makes me wonder whether the quantization error could be reduced even more by permuting the matrix so values of similar magnitude are quantized together.


I think so. There are already some techniques called rotation, which have similar effects. But it will incur additional overheads in diffusion models.

Permuting entire columns at once should have zero overhead as long as you permute the rows of the next matrix to match. But as each entry of a column participates in a different scaling group, I guess swapping two columns will reduce quantization error for some while increasing it for others, making it unlikely to get a significant overall improvement in this way.

FLUX-schnell is only 800ms on RTX 5090.

This is amazing

Now release the LoRa conversion code you promised months ago…



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: