Data compression can reduce the space required to store a single payload. That's...

ithkuil · on Sept 19, 2024

I think it is useful to consider this a form of lossless compression not only because it technically is a form of lossless compression but also because a simple compression algorithm like this is a good introduction to the various tradeoffs that compression algorithms can have.

This is an instance of dictionary encoding where terms are words and dictionary code space is the entire machine addressing space (heap).

The disadvantage of this approach is that you can only "deduplicate" words that are exactly the same and you cannot remove duplicated substrings. Another disadvantage is that the dictionary code size is 8 bytes which means you achieve some saving only for words that are longer than 8 bytes (plus some other overhead for the interning index).

Compression algorithms that use dictionary encoding typically use code/symbol sizes smaller than 64 bits, often using variable length encoded numbers encoded with some kind of entropy coding that is able to use short bit sequences for codes representing dictionary entries appearing often and longer bit sequences for less frequent ones.

But the idea is the same. During decompression you read each "number" from the compressed stream and you use it effectively to index an array that contains the target (sub)strings.

The algorithm in the post just interprets the 64-bit dictionary code as an index in the giant array that is the machine addressing space.

rat9988 · on Sept 19, 2024

I'm not sure I can follow you here. Data compression cannot reduce the size of a single word if we count the dictionary size.

recursive · on Sept 19, 2024

It's not impossible... but who said anything about a single word?

wjholden · on Sept 19, 2024

I like the term "deduplication."