This episode was fascinating. I had heard of LZ4 but not Zstd. It spurred me to ...

pmarreck · on May 12, 2023

Probably the most underrated feature of zstd (likely because it's so unusual) is the ability to create a separate compression dictionary. This allows you to develop customized and highly efficient dictionaries that are highly specific to a type of data AND allow you to compress elements of that data without including an entire separate dictionary in every compression output.

So for example take logfiles. You can train up a dictionary on some sample log data. Then you can compress individual log rows, and all it actually stores is a diff of the compression dictionary (if any new entries were added) and the compressed data. So you get very efficient compression of small amounts of data which are part of a collection that may be very self-similar, but with the option of decompressing any individual element at will. (Of course, you'd need to hold onto the original trained dictionary for both compression and decompression, for any row you want to be able to decompress in the future. And you might want to retrain the dictionary every so often for slowly-changing types of data, which might prevent "drift" of the efficiency towards less-efficient over time)

I believe Postgres already uses this under the hood for some columnar data. It wouldn't take much to index it before compressing it and just decompress it at will. Or maybe it just got added? https://devm.io/databases/postgresql-release

fnordpiglet · on May 12, 2023

I do this to make extraordinarily small UDP packets for a low latency system. I record the raw payload then build a dictionary for the data, then share it on both sides. It reduces the packet overhead by removing the dictionary and it does a much better job than other approaches.

muragekibicho · on May 12, 2023

I saw that zstd and brotli both suppport creating custom dictionaries but I couldn't find any tutorials showing how to do this. Perhaps you could share code?

muragekibicho · on May 12, 2023

I saw that zstd and brotli both suppport creating custom dictionaries but I couldn't find any tutorials showing how to do this. Perhaps you could share code?

pmarreck · on May 12, 2023

Basically,

`zstd --train <path/to/directory/of/many/small/example/files/>`

will output a dictionary file, and then the `-D <path/to/dictionary/file>` option when used for either compression or decompression will then use that dictionary first.

You can also investigate "man zstd" or google "zstd --train" for more details. The directory for the training must consist of many small files each of which is an example artifact; if you want to split, say, a single log file into files of each line, you can use, say, a bash script like this (note that I just created this with ChatGPT and eyeballed it, it looks correct but I haven't run it yet!): https://gist.github.com/pmarreck/91124e761e45d6860834eb046d6... (Also, don't forget to set it as executable with `chmod +x split_file.bash` before you try to run it directly)

muragekibicho · on May 12, 2023

Thank you so much. I was trying to create a dictionary last night and your comment was sent by God. You're doing the Lord's work frfr! I followed you on GitHub!

pmarreck · on May 12, 2023

remember that if you don't understand a particular line of code, you can have chatgpt explain it... have fun

chasil · on May 12, 2023

There are also a parallel versions of bzip2 (pbzip2), lzip (plzip), xz (pixz).

Depending upon the data, the non-threaded versions of these utilities can have higher performance when run with some kind of dispatcher on multiple files.

The GNU xargs utility is able to do this, and the relevant features are also in busybox.