The entire English dictionary (being very generous on what a "word" is) is aroun...

Banana699 · on May 21, 2022

The vast majority of text objects anybody generates won't exceed 4MB though. 4MB is 4 million ascii/utf8 characters, if we assume a typical word in English is 4 characters for simplicity, that's 1 million words without spaces. A quick Google search for "Novel that is 1 million words" yields the fact that this is twice the word count of Lord of the Ring and the word count of the first 3 A Song of Ice and Fire (Game of Thrones) books. Accounting for whitespace, longer common words, and inefficient encoding schemes would bring that overestimate down to, what, 150K words? a 300-page or 600-page book (depending on the spacing) according to Google, still massive.

I see it only working where there's massive pooling, like you say. An OS or a tool provides a known dictionary service and you call into it with a text string and get back an array of indices that you can then decode with another call. That amortizes the dictionary cost among all possible uses of the service, which is a lot if it's a standard well-known service. Another scenario is perhaps in a database\cloud storage\social media, any single text object might be small but the total text they store overall is massive.

MereInterest · on May 21, 2022

Except that the corpus doesn't need to be stored as part of the compressed message, and can be considered part of the compression algorithm. It increases the size of the decoder by ~4MB, but doesn't increase the size of each message.