Awesome post, it's great to see someone diving into the details of Zstd! Maintai...

aidanhs · on May 19, 2022

First off, I love zstd and thanks for your work - I've used it multiple times with great success.

My question is...what's up with zstd compression levels? It seems to be impossible to find documentation on what the supported compression levels are, and then there are magic numbers that make things more confusing (I think last time I checked, 0 means 'default' which translates to level 3?)

My notes when I was trying to track through the minimum compression level through multiple header files seemed to indicate it's MINCLEVEL...which appears to be -131072? But it starts talking about block sizes and target length, and I'm not entirely sure why they would relate to a compression level.

  # define MINCLEVEL  ZSTD_minCLevel() [0]
  int ZSTD_minCLevel(void) { return (int)-ZSTD_TARGETLENGTH_MAX; } [1]
  #define ZSTD_TARGETLENGTH_MAX    ZSTD_BLOCKSIZE_MAX [2]
  #define ZSTD_BLOCKSIZE_MAX     (1<<ZSTD_BLOCKSIZELOG_MAX) [3]
  #define ZSTD_BLOCKSIZELOG_MAX  17 [3]

[0] https://github.com/facebook/zstd/blob/550410d05d7c7815b1ff41... [1] https://github.com/facebook/zstd/blob/550410d05d7c7815b1ff41... [2] https://github.com/facebook/zstd/blob/550410d05d7c7815b1ff41... [3] https://github.com/facebook/zstd/blob/550410d05d7c7815b1ff41...

Basically, is there a decent reference I can be looking at here?

terrelln · on May 19, 2022

Thanks for the feedback! I've opened an issue to track this [0]

* Levels 1-19 are the "standard" compression levels.

* Levels 20-22 are the "ultra" levels which require --ultra to use on the CLI. They allocate a lot of memory and are very slow.

* Level 0 is the default compression level, which is 3.

* Levels < 0 are the "fast" compression levels. They achieve speed by turning off Huffman compression, and by "accelerating" compression by a factor. Level -1 has acceleration factor 1, -2 has acceleration factor 2, and so on. So the minimum supported negative compression level is -131072, since the maximum acceleration factor is our block size. But in practice, I wouldn't think a negative level lower than -10 or -20 would be all that useful.

[0] https://github.com/facebook/zstd/issues/3133

terrelln · on May 19, 2022

We're still reserving the right to fiddle around with the meaning of our negative compression levels. We think that we may be able to offer more compression at the same speeds by completely changing our search strategies for very fast compression speeds. But, there is only so much time in the day, and we haven't had time to investigate it yet. So we don't want to lock ourselves into a particular scheme right now.

aidanhs · on May 19, 2022

This is exactly what I was hoping for! If you just copied and pasted this into the documentation directly, that'd be more than enough. Thanks for writing it out so clearly and creating the issue.

teddyh · on May 19, 2022

> They allocate a lot of memory and are very slow.

Slow and takes memory to compress, uncompress, or both?

jimis · on May 19, 2022

They are only slow to compress. They are as fast to decompress, or even faster sometimes. Source: I tested.

terrelln · on May 19, 2022

Yeah thats correct. I'll just point out that they use a larger window size, so they will use more memory to decompress, but will still be fast.

namibj · on May 19, 2022

Both.

Rafuino · on May 18, 2022

Ohh, don't mind if I do! I'm working on CPU libraries to improve compression performance. Does zstd benefit today from ISAs like AVX-512, AVX-2, etc.? Do you foresee the need in the future to offload compression/decompression to an accelerator?

Another Q, what hardware platform(s) do you use to test new versions of zstd for regressions/improvements? I see the Github page shows a consumer CPU in use, but what about server CPUs pushing maximum throughput with many threads/cores?

terrelln · on May 19, 2022

> Does zstd benefit today from ISAs like AVX-512, AVX-2, etc.?

Zstd benefits mostly from BMI(2), it takes advantage of shlx, shrx, and bsf during entropy (de)coding. We do use SSE2 instructions in our lazy match finder to filter matches based on a 1-byte hash, in a similar way that F14 and Swiss hash tables do. We also use vector loads/stores to do our copies during decompression.

We don't currently benefit from AVX2. There are strategies to use AVX2 for Huffman & FSE compression, but they don't fit well with zstd's format, so we don't use them there. Additionally, our latest Huffman decoder is very competitive with the AVX2 decoder on real data. And a Huffman format that is fast for AVX2 is often slow for scalar decoding, so it is hard to opt into it unless you know you will only run on CPUs that support AVX2. Lastly, AVX2 entropy decoders rely on a fast gather instruction, which isn't available on AMD machines.

> Do you foresee the need in the future to offload compression/decompression to an accelerator?

Yes, there are trends in this direction. Intel QAT supports zlib hardware acceleration. The PS5 has accelerated decompression. I'm pretty sure Microsoft has also released a paper about a HW accelerated algorithm. Compression & decompression take up a significant portion of datacenter CPU, so it makes sense to hardware accelerate it.

> what hardware platform(s) do you use to test new versions of zstd for regressions/improvements?

We mostly test and optimize for x86-64 CPUs, on a mix of server and consumer CPUs. But, we also test on ARM devices to make sure we don't have major regressions. We've found that optimizing for x86-64 has done a good job of getting good baseline performance on other architectures, though there is certainly room left to improve.

elmolino89 · on May 19, 2022

In bioinformatics there is a lot of usage of blocked gzip, as: concatenated separated compressed chunks which can be indexed and decompressed independently. See BAM format/samtools.

Can something like that be adopted by zstd?

mananaysiempre · on May 19, 2022

There is libzstd-seek[1] which implements one convention[2] and also ZRA[3] which implements its own thing (apparently?) and then there’s the splittability discussion[4]... Clearly people want seekability, clearly the Zstandard maintainers do not want to get locked into a suboptimal solution. But I have to join this question: I can haz zstd seeks plz?

[1] https://github.com/martinellimarco/libzstd-seek

[2] https://github.com/facebook/zstd/blob/dev/contrib/seekable_f...

[3] https://github.com/zraorg/ZRA

[4] https://github.com/facebook/zstd/issues/395

twic · on May 19, 2022

Pardon my ignorance, but isn't this blocking something which could exist completely separately from / orthogonally to any particular compression algorithm? Is this a question of zstd supporting a blocked compressor, or of a blocked compressor supporting zstd?

dspillett · on May 19, 2022

The advantage of it being built into the compressor is that you have a standard format that works with the standard tools. If you wrap compressed blocks in a container format everyone using the result needs tools that understand the container layout.

For instance gzip with the --rsyncable option: it helps the rsync process and similar but anything that doesn't care can still unpack the compressed file with any old gzip decompression routine. There is a cost in that the dictionary resets result in a small reduction in compression rates for most content, but you've not introduced a new format.

The bioinformatics people noted above don't use this despite using gzip (instead using a container format for gzip packed blocks) because the result isn't directly seekable even though in theory you could start decompression at any point that the dictionary was reset. You could add an external index file that lists “output byte X is the start of a new cycle, at input byte Y” which would solve that without changing the gzip format itself, which might be how I'd have done things, but there are then issues of keeping the content and index file together, and verifying you have the right index, as data is passed around various places.

The zip format resets compression at file boundaries (each input file is compressed separately) and it keeps an index of where each file starts in the compressed output, hence you can see at the file level and unpack individual files, but doesn't index within each file so you can't use this to seek around the compressed version of a single large file. Resetting compression at file boundaries is why zip is particularly inefficient when compressing many small files, one of the reasons .tar.gz is preferred for source code archives even with the --rsyncable option.

So if it were possible (and practical) with the zstd output format, an indexed version of what gzip --rsyncable does could be a good compromise for a standard format that is seekable where it needs to be but with minimal reduction in compression. Assuming of course that zstd becomes as ubiquitous as gzip, available by default in most places, otherwise not needing different tools to read the compressed files is a moot point and the reduced compression becomes just a cost not a compromise.

elmolino89 · on May 19, 2022

>The bioinformatics people noted above don't use this despite using gzip (instead using a container format for gzip packed blocks) because the result isn't directly seekable

Not sure if we are talking about exactly the same thing, but one of the points of using such blocks is a really fast random access to bunch of records located in some bzip-ed chunk.

This works really well if you fulfill some prerequisites (simplified): 1. the pre-compressed, usually tab/space delimited data must be sorted (typical: by chromosome and position) 2. you create an index telling you that the data for chromosome such_and_such in an interval say 2_000_000-2_100_000 is in a chunk 1234 and it this chunk is at the position 987654321 in the file.

As long as all you do is "give me all records and all columns from interval X", this works really fast. On the other hand it can be abused/pushed to imho absurd levels where data files with thousands of columns are queried in that way. Once you got giant say in tens/hundreds of Gb block-compressed files adding new data (that would be extra columns...) to such monstrosities becomes a real PITA.

dspillett · on May 19, 2022

> Not sure if we are talking about exactly the same thing

I think we are. The tweak that --rsyncable does to the compression stream reduces the effect of small changes early in the input but still only allows for forward motion through the data when processing. If an index of those reset points was kept it would act like the index you describe so you could random access in the compressed output with source block granularity and only have to decompress the blocks needed.

The original post I replied to mentioned gzip, though you mention bzip - not sure if the latter is a typo or you mean bzip2. bzip2 might make sense for that sort of data being compressed once and read many times (it is slower but with that sort of data likely to get far better compression), though it has no similar option to gzip's --rsyncable, but pbzip2's method of chunking may have the same effect.

elmolino89 · on May 19, 2022

Sorry, indeed I made typo:

http://www.htslib.org/doc/bgzip.html

vanderZwan · on May 19, 2022

How did you manage to grok FSE/ANS? Speaking as someone who has tried and failed and would like to try again

namibj · on May 19, 2022

State at the Wikipedia page's tANS and generic ANS diagrams and explanations for a couple hours.

vanderZwan · on May 19, 2022

Like in-person explanations? Or do you have any links you recommend?

namibj · on May 21, 2022

The en and maybe de Wikipedia articles and the diagrams (incl. subtext/descriptions) contained within