SVDQuant+NVFP4: 4× Smaller, 3× Faster FLUX with 16-bit Quality on Blackwell GPUs

lmxyy · 2025-02-22T00:46:18 1740185178

SVDQuant supports NVFP4 on NVIDIA Blackwell GPUs with 3× speedup over BF16 and better image quality than INT4. Try our interactive demo below or at https://svdquant.mit.edu/! Our code is all available at https://github.com/mit-han-lab/nunchaku!

semi-extrinsic · 2025-02-22T09:39:53 1740217193

I assume they've messed up the prompt caption for the squirrel-looking creature?

Interesting to see how poor the prompt adhesion is in these examples. The cyanobacteria one is just "an image of the ocean". The skincare one completely ignores 50% of the ingredients in the prompt, and makes coffee beans the size and shape of almonds.

yorwba · 2025-02-22T05:16:56 1740201416

I thought I'd already seen this in the previous discussion 3 months ago https://news.ycombinator.com/item?id=42093112 but that one used INT4 quantization, so NVFP4 is a further improvement on that. Sweet!

If I found the correct docs https://docs.nvidia.com/deeplearning/cudnn/frontend/latest/o... NVFP4 means 16 4-bit floating-point values (1 sign bit, 2 for the exponent, 1 for the mantissa) each have one shared 8-bit floating point scaling factor (1 sign bit, 4 exponent, 3 mantissa), so strictly speaking it's 4.5 bits per value.

This grouped scaling immediately makes me wonder whether the quantization error could be reduced even more by permuting the matrix so values of similar magnitude are quantized together.

lmxyy · 2025-02-22T06:40:59 1740206459

I think so. There are already some techniques called rotation, which have similar effects. But it will incur additional overheads in diffusion models.

yorwba · 2025-02-22T06:59:19 1740207559

Permuting entire columns at once should have zero overhead as long as you permute the rows of the next matrix to match. But as each entry of a column participates in a different scaling group, I guess swapping two columns will reduce quantization error for some while increasing it for others, making it unlikely to get a significant overall improvement in this way.

lmxyy · 2025-02-22T02:20:53 1740190853

FLUX-schnell is only 800ms on RTX 5090.

beebaween · 2025-02-22T02:10:21 1740190221

This is amazing

42lux · 2025-02-22T08:23:28 1740212608

Now release the LoRa conversion code you promised months ago…