If you are interested in optimizing parallel decompression and you happen to have a suitable NVIDIA GPU, GDeflate [1] is interesting. The target market for this is PC games using DirectStorage to quickly load game assets. The graph in [1] shows DirectStorage maxing out the throughput of a PCIe Gen 3 drive at about 3 GiB/s when compression is not used. When GPU GDeflate is used, the effective rate hits 12 GiB/s.
If you have suitable hardware running Windows, you can try this out for yourself using Microsoft's DirectStorage GPU decompression benchmark [2].
A reference implementation of a single threaded compressor and multi (CPU) threaded decompressor can be found at [3]. It is Apache-2 licensed.
I assume this is for decompressing multiple independent deflate streams in parallel?
What's the throughput if you only have a single stream? I realise this is the unhappy-case for GPU acceleration, hence my question! (I've been thinking about some approaches to parallelize decompression of a single stream, it's not easy)
The GPU decompression benchmark I linked earlier allows you to specify a single file that it will compress with GDeflate (and zlib for comparison). The numbers presented in the docs that come with the benchmark and presented elsewhere are consistent with my own runs using a source file that is highly compressible.
Part of the trick of achieving this speedup is to read the data fast enough. I don't know of any NVMe drive that can reach full speed with a queue depth of 1. While running the benchmark in a windows VM with a GPU passed through, on the linux host I observed that the average read size was about 512k and the queue depth was sometimes over 30.
I consider using an index to be "cheating" - or rather, my intended use-case is decompression of a stream that you've never seen before, which was generated by a "dumb" compressor.
That said, the approach I intend to take is similar. The idea is that one thread is dedicated to "looking ahead", parsing as fast as it can (or even jumping far ahead and using heuristics to re-sync the parse state. There will be false-positives but you can verify them later), building an index but not actually doing decompression, while secondary threads are spawned to do decompression from the identified block start points. The hard part is dealing with missing LZ references to data that hasn't yet been decompressed. Worst-case performance will be abysmal, but I think on most real-world data, you'll be able to beat a serial decompressor if you can throw enough threads at it.
There also is this: https://github.com/mxmlnkn/pragzip I did some benchmarks on some really beefy machines with 128 cores and was able to reach almost 20 GB/s decompression bandwidth.
Interesting. It looks like https://github.com/zrajna/zindex became public about a year after my searches for parallel uncompression came up empty and I started hacking on pigz.
The model weights (the thing being updated by the training process) stay loaded in gpu memory during training (the slow part). This could be useful to serialize the model weights to disk when checkpointing or completed, but it's a drop in the bucket compared to the rest of the time spent training.
If you have suitable hardware running Windows, you can try this out for yourself using Microsoft's DirectStorage GPU decompression benchmark [2].
A reference implementation of a single threaded compressor and multi (CPU) threaded decompressor can be found at [3]. It is Apache-2 licensed.
1. https://developer.nvidia.com/blog/accelerating-load-times-fo...
2. https://github.com/microsoft/DirectStorage/tree/main/Samples...
3. https://github.com/microsoft/DirectStorage/blob/main/GDeflat...
Disclaimer: I work for NVIDIA, have nothing to do with this, and am not speaking for NVIDIA.
Edit: oops, lost the last sentence in the first paragraph during an edit.