If you are interested in optimizing parallel decompression and you happen to hav...

shaklee3 · on May 13, 2023

Nvidia also has nvcomp: https://github.com/NVIDIA/nvcomp

Retr0id · on May 12, 2023

> the effective rate hits 12 GiB/s

I assume this is for decompressing multiple independent deflate streams in parallel?

What's the throughput if you only have a single stream? I realise this is the unhappy-case for GPU acceleration, hence my question! (I've been thinking about some approaches to parallelize decompression of a single stream, it's not easy)

mgerdts · on May 12, 2023

The data is compressed with GDeflate, not deflate. The single stream is designed to use the parallelism of a GPU. It is described here:

https://github.com/microsoft/DirectStorage/blob/main/GDeflat...

The GPU decompression benchmark I linked earlier allows you to specify a single file that it will compress with GDeflate (and zlib for comparison). The numbers presented in the docs that come with the benchmark and presented elsewhere are consistent with my own runs using a source file that is highly compressible.

Part of the trick of achieving this speedup is to read the data fast enough. I don't know of any NVMe drive that can reach full speed with a queue depth of 1. While running the benchmark in a windows VM with a GPU passed through, on the linux host I observed that the average read size was about 512k and the queue depth was sometimes over 30.

mgerdts · on May 12, 2023

> I've been thinking about some approaches to parallelize decompression of a single stream, it's not easy

You saw this, right?

https://news.ycombinator.com/item?id=35915285

Retr0id · on May 12, 2023

I consider using an index to be "cheating" - or rather, my intended use-case is decompression of a stream that you've never seen before, which was generated by a "dumb" compressor.

That said, the approach I intend to take is similar. The idea is that one thread is dedicated to "looking ahead", parsing as fast as it can (or even jumping far ahead and using heuristics to re-sync the parse state. There will be false-positives but you can verify them later), building an index but not actually doing decompression, while secondary threads are spawned to do decompression from the identified block start points. The hard part is dealing with missing LZ references to data that hasn't yet been decompressed. Worst-case performance will be abysmal, but I think on most real-world data, you'll be able to beat a serial decompressor if you can throw enough threads at it.

mxmlnkn · on May 12, 2023

There also is this: https://github.com/mxmlnkn/pragzip I did some benchmarks on some really beefy machines with 128 cores and was able to reach almost 20 GB/s decompression bandwidth.

mgerdts · on May 12, 2023

Interesting. It looks like https://github.com/zrajna/zindex became public about a year after my searches for parallel uncompression came up empty and I started hacking on pigz.

gumballindie · on May 12, 2023

Posted a question (now deleted) asking if it could be done on the gpu not noticing you already posted this. Thanks for sharing.

nwoli · on May 12, 2023

I wonder if and if not why not ML uses this to speed up training

emef · on May 12, 2023

The model weights (the thing being updated by the training process) stay loaded in gpu memory during training (the slow part). This could be useful to serialize the model weights to disk when checkpointing or completed, but it's a drop in the bucket compared to the rest of the time spent training.

nwoli · on May 13, 2023

I meant it more for the image data