It’s an experimental result by Shannon: https://archive.org/details/bstj30-1-50/...

malfist · 2024-12-18T13:58:58 1734530338

Say that experiment is correct. Wouldn't that imply that the information context of a single letter varies based on the possible future permutations?

I.e., The string "I'v_" provides way more context than "con_" because you're much more likely to get I'm typing "I've" instead of "contraception"

That seems to disprove the idea that a letter is a bit.

Also the fact that there are more than two letters also indicate more than one bit, though I wouldn't want to even start to guess the encoding scheme of the brain

wat10000 · 2024-12-18T14:11:39 1734531099

I don’t follow. Of course the probabilities change depending on context. 1 bit per letter is an average, not an exact measure for each individual letter. There are cases where the next letter is virtually guaranteed, and the information content of that letter is much less than one bit. There are cases where it could easily be many different possibilities and that’s more than one bit. On average it’s about one bit.

> Also the fact that there are more than two letters also indicate more than one bit

This seems to deny the possibility of data compression, which I hope you’d reconsider, given that this message has probably been compressed and decompressed several times before it gets to you.

Anyway, it should be easy to see that the number of bits per symbol isn’t tied to the number of symbols when there’s knowledge about the structure of the data. Start with the case where there are 256 symbols. That implies eight bits per symbol. Now take this comment, encode it as ASCII, and run it through gzip. The result is less than 8 bits per symbol.

For a contrived example, consider a case where a language has three symbols, A, B, and C. In this language, A appears with a frequency of 999,999,998 per billion. B and C each appear with a frequency of one in a billion. Now, take some text from this language and apply a basic run-length encoding to it. You’ll end up with something like 32 bits per billion letters on average (around 30 bits to encode a typical run length of approximately 1 billion, and 2 bits to encode which letter is in the run), which is way less than one bit per letter.

taffer · 2024-12-18T15:01:24 1734534084

> I.e., The string "I'v_" provides way more context than "con_" because you're much more likely to get I'm typing "I've" instead of "contraception"

Yes the entropy of the next letter always depends on the context. One bit per letter is just an average for all kinds of contexts.

> Also the fact that there are more than two letters also indicate more than one bit

Our alphabet is simply not the most efficient way of encoding information. It takes about 5 bits to encode 26 letters, space, comma and period. Even simple algorithms like Huffman or LZ77 only require just 3 bits per letter. Current state-of-the-art algorithms compress the English Wikipedia using a mere 0.8 bits per character: https://www.mattmahoney.net/dc/text.html

dTal · 2024-12-18T15:51:53 1734537113

>I don’t think anyone has a scheme that can do this

If you substitute "token", for "letter", what you have described is exactly what a modern LLM does, out of the box. llama.cpp even has a setting, "show logits", which emits the probability of each token (sadly, only of the text it outputs, not the text it ingests - oh well).

I don't think anyone actually uses this as a text compressor for reasons of practicality. But it's no longer a theoretical thought experiment - it's possible today, on a laptop. Certainly you can experimentally verify Shannon's result, if you believe that LLMs are a sufficiently high fidelity model of English (you should - it takes multiple sentences before it's possible to sniff that text is LLM generated, a piece of information worth a single bit).

Oh look, Fabrice Bellard (who else?) already did it: https://bellard.org/ts_zip/ and you may note that indeed, it achieves a compression ratio of just north of 1 bit per byte, using a very small language model.