In bioinformatics there is a lot of usage of blocked gzip, as: concatenated sepa...

mananaysiempre · on May 19, 2022

There is libzstd-seek[1] which implements one convention[2] and also ZRA[3] which implements its own thing (apparently?) and then there’s the splittability discussion[4]... Clearly people want seekability, clearly the Zstandard maintainers do not want to get locked into a suboptimal solution. But I have to join this question: I can haz zstd seeks plz?

[1] https://github.com/martinellimarco/libzstd-seek

[2] https://github.com/facebook/zstd/blob/dev/contrib/seekable_f...

[3] https://github.com/zraorg/ZRA

[4] https://github.com/facebook/zstd/issues/395

twic · on May 19, 2022

Pardon my ignorance, but isn't this blocking something which could exist completely separately from / orthogonally to any particular compression algorithm? Is this a question of zstd supporting a blocked compressor, or of a blocked compressor supporting zstd?

dspillett · on May 19, 2022

The advantage of it being built into the compressor is that you have a standard format that works with the standard tools. If you wrap compressed blocks in a container format everyone using the result needs tools that understand the container layout.

For instance gzip with the --rsyncable option: it helps the rsync process and similar but anything that doesn't care can still unpack the compressed file with any old gzip decompression routine. There is a cost in that the dictionary resets result in a small reduction in compression rates for most content, but you've not introduced a new format.

The bioinformatics people noted above don't use this despite using gzip (instead using a container format for gzip packed blocks) because the result isn't directly seekable even though in theory you could start decompression at any point that the dictionary was reset. You could add an external index file that lists “output byte X is the start of a new cycle, at input byte Y” which would solve that without changing the gzip format itself, which might be how I'd have done things, but there are then issues of keeping the content and index file together, and verifying you have the right index, as data is passed around various places.

The zip format resets compression at file boundaries (each input file is compressed separately) and it keeps an index of where each file starts in the compressed output, hence you can see at the file level and unpack individual files, but doesn't index within each file so you can't use this to seek around the compressed version of a single large file. Resetting compression at file boundaries is why zip is particularly inefficient when compressing many small files, one of the reasons .tar.gz is preferred for source code archives even with the --rsyncable option.

So if it were possible (and practical) with the zstd output format, an indexed version of what gzip --rsyncable does could be a good compromise for a standard format that is seekable where it needs to be but with minimal reduction in compression. Assuming of course that zstd becomes as ubiquitous as gzip, available by default in most places, otherwise not needing different tools to read the compressed files is a moot point and the reduced compression becomes just a cost not a compromise.

elmolino89 · on May 19, 2022

>The bioinformatics people noted above don't use this despite using gzip (instead using a container format for gzip packed blocks) because the result isn't directly seekable

Not sure if we are talking about exactly the same thing, but one of the points of using such blocks is a really fast random access to bunch of records located in some bzip-ed chunk.

This works really well if you fulfill some prerequisites (simplified): 1. the pre-compressed, usually tab/space delimited data must be sorted (typical: by chromosome and position) 2. you create an index telling you that the data for chromosome such_and_such in an interval say 2_000_000-2_100_000 is in a chunk 1234 and it this chunk is at the position 987654321 in the file.

As long as all you do is "give me all records and all columns from interval X", this works really fast. On the other hand it can be abused/pushed to imho absurd levels where data files with thousands of columns are queried in that way. Once you got giant say in tens/hundreds of Gb block-compressed files adding new data (that would be extra columns...) to such monstrosities becomes a real PITA.

dspillett · on May 19, 2022

> Not sure if we are talking about exactly the same thing

I think we are. The tweak that --rsyncable does to the compression stream reduces the effect of small changes early in the input but still only allows for forward motion through the data when processing. If an index of those reset points was kept it would act like the index you describe so you could random access in the compressed output with source block granularity and only have to decompress the blocks needed.

The original post I replied to mentioned gzip, though you mention bzip - not sure if the latter is a typo or you mean bzip2. bzip2 might make sense for that sort of data being compressed once and read many times (it is slower but with that sort of data likely to get far better compression), though it has no similar option to gzip's --rsyncable, but pbzip2's method of chunking may have the same effect.

elmolino89 · on May 19, 2022

Sorry, indeed I made typo:

http://www.htslib.org/doc/bgzip.html