I have a concept I have explored a bit but has yet to yield a positive response ...

waterhouse · on Aug 11, 2019

There are some compressors (zstd is the one I'm thinking of) that accept "dictionaries", which are meant to be produced from datasets similar to what you're going to compress; I would guess they contain something resembling frequency tables. https://github.com/facebook/zstd has some description but doesn't explain precisely what the dictionary contains.

terrelln · on Aug 13, 2019

Zstd's dictionaries contain two things:

1. Predefined statistics based on the training data for literals (bytes we couldn't find matches for), literal lengths, match lengths, and offset codes. These allow us to use tuned statistics without the cost of putting the tables in the headers, which saves us 100-200 bytes. 2. Content. Unstructured excerpts from the training data that are very common. This gets "prefixed" the the data before compression and decompression, to seed the compressor with some common history.

Dictionaries are very powerful tools for small data, but they stop being effective once you get to 100KB or more.

sagebird · on Aug 13, 2019

I can imagine dictionaries as being useful for:

1. Pre distributed dictionaries for commonly transmitted data types. Ie: web browsers could ship with a shared dictionary that is generally helpful for JavaScript, then servers could negotiate and send a specially compressed version for those that have the dictionary.

2. Streaming buffered data: eg before sending video, send a dictionary. This is useful because we are interested in compressing each chunk well, not just the overall file. Relatedly - a compression scheme where you need all bytes before you can decompress any is rather useless here.