Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are some compressors (zstd is the one I'm thinking of) that accept "dictionaries", which are meant to be produced from datasets similar to what you're going to compress; I would guess they contain something resembling frequency tables. https://github.com/facebook/zstd has some description but doesn't explain precisely what the dictionary contains.


Zstd's dictionaries contain two things:

1. Predefined statistics based on the training data for literals (bytes we couldn't find matches for), literal lengths, match lengths, and offset codes. These allow us to use tuned statistics without the cost of putting the tables in the headers, which saves us 100-200 bytes. 2. Content. Unstructured excerpts from the training data that are very common. This gets "prefixed" the the data before compression and decompression, to seed the compressor with some common history.

Dictionaries are very powerful tools for small data, but they stop being effective once you get to 100KB or more.


I can imagine dictionaries as being useful for:

1. Pre distributed dictionaries for commonly transmitted data types. Ie: web browsers could ship with a shared dictionary that is generally helpful for JavaScript, then servers could negotiate and send a specially compressed version for those that have the dictionary.

2. Streaming buffered data: eg before sending video, send a dictionary. This is useful because we are interested in compressing each chunk well, not just the overall file. Relatedly - a compression scheme where you need all bytes before you can decompress any is rather useless here.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: