I have a concept I have explored a bit but has yet to yield a positive response (I have not experimented long enough to come to any conclusion):
Can you add text to a string of text to make the compressed string shorter?
I suppose you could call it “compressor hinting” and it would be specific to the compressing algorithm. The added text would be tagged through an escape sequence, so it could be removed after the decompressing stage.
My naive approach is to add randomly generated hints to at randomly chosen location and then gzip/ungzip. I haven’t had success yet. I think that the potential is limited by the expressiveness of the compressor’s “instruction set” - ie - can it understand generalized hints.
There are some compressors (zstd is the one I'm thinking of) that accept "dictionaries", which are meant to be produced from datasets similar to what you're going to compress; I would guess they contain something resembling frequency tables. https://github.com/facebook/zstd has some description but doesn't explain precisely what the dictionary contains.
1. Predefined statistics based on the training data for literals (bytes we couldn't find matches for), literal lengths, match lengths, and offset codes. These allow us to use tuned statistics without the cost of putting the tables in the headers, which saves us 100-200 bytes.
2. Content. Unstructured excerpts from the training data that are very common. This gets "prefixed" the the data before compression and decompression, to seed the compressor with some common history.
Dictionaries are very powerful tools for small data, but they stop being effective once you get to 100KB or more.
1. Pre distributed dictionaries for commonly transmitted data types. Ie: web browsers could ship with a shared dictionary that is generally helpful for JavaScript, then servers could negotiate and send a specially compressed version for those that have the dictionary.
2. Streaming buffered data: eg before sending video, send a dictionary. This is useful because we are interested in compressing each chunk well, not just the overall file. Relatedly - a compression scheme where you need all bytes before you can decompress any is rather useless here.
Can you add text to a string of text to make the compressed string shorter?
I suppose you could call it “compressor hinting” and it would be specific to the compressing algorithm. The added text would be tagged through an escape sequence, so it could be removed after the decompressing stage.
My naive approach is to add randomly generated hints to at randomly chosen location and then gzip/ungzip. I haven’t had success yet. I think that the potential is limited by the expressiveness of the compressor’s “instruction set” - ie - can it understand generalized hints.