Hacker News new | past | comments | ask | show | jobs | submit login

It’s an experimental result by Shannon: https://archive.org/details/bstj30-1-50/page/n5/mode/1up

In short, you show someone an English text cut off at an arbitrary point and ask them to predict which letter comes next. Based on how successful they are, you can calculate the information content of the text. The result from this experiment was approximately one bit per letter.

Representing it is not the concern of the experiment. I don’t think anyone has a scheme that can do this. But it’s straightforward enough in theory. You create a compressor which contains a simulated human English speaker. At each point, ask the simulation to rank all the letters that might come next, in order. Emit the rank of the actual next letter into your compressed data. To decompress, run the same procedure, but apply the ranks you read from the data stream to the simulation’s predictions. If your simulation is deterministic, this will produce output matching the compressor’s input.




Say that experiment is correct. Wouldn't that imply that the information context of a single letter varies based on the possible future permutations?

I.e., The string "I'v_" provides way more context than "con_" because you're much more likely to get I'm typing "I've" instead of "contraception"

That seems to disprove the idea that a letter is a bit.

Also the fact that there are more than two letters also indicate more than one bit, though I wouldn't want to even start to guess the encoding scheme of the brain


I don’t follow. Of course the probabilities change depending on context. 1 bit per letter is an average, not an exact measure for each individual letter. There are cases where the next letter is virtually guaranteed, and the information content of that letter is much less than one bit. There are cases where it could easily be many different possibilities and that’s more than one bit. On average it’s about one bit.

> Also the fact that there are more than two letters also indicate more than one bit

This seems to deny the possibility of data compression, which I hope you’d reconsider, given that this message has probably been compressed and decompressed several times before it gets to you.

Anyway, it should be easy to see that the number of bits per symbol isn’t tied to the number of symbols when there’s knowledge about the structure of the data. Start with the case where there are 256 symbols. That implies eight bits per symbol. Now take this comment, encode it as ASCII, and run it through gzip. The result is less than 8 bits per symbol.

For a contrived example, consider a case where a language has three symbols, A, B, and C. In this language, A appears with a frequency of 999,999,998 per billion. B and C each appear with a frequency of one in a billion. Now, take some text from this language and apply a basic run-length encoding to it. You’ll end up with something like 32 bits per billion letters on average (around 30 bits to encode a typical run length of approximately 1 billion, and 2 bits to encode which letter is in the run), which is way less than one bit per letter.


> I.e., The string "I'v_" provides way more context than "con_" because you're much more likely to get I'm typing "I've" instead of "contraception"

Yes the entropy of the next letter always depends on the context. One bit per letter is just an average for all kinds of contexts.

> Also the fact that there are more than two letters also indicate more than one bit

Our alphabet is simply not the most efficient way of encoding information. It takes about 5 bits to encode 26 letters, space, comma and period. Even simple algorithms like Huffman or LZ77 only require just 3 bits per letter. Current state-of-the-art algorithms compress the English Wikipedia using a mere 0.8 bits per character: https://www.mattmahoney.net/dc/text.html


>I don’t think anyone has a scheme that can do this

If you substitute "token", for "letter", what you have described is exactly what a modern LLM does, out of the box. llama.cpp even has a setting, "show logits", which emits the probability of each token (sadly, only of the text it outputs, not the text it ingests - oh well).

I don't think anyone actually uses this as a text compressor for reasons of practicality. But it's no longer a theoretical thought experiment - it's possible today, on a laptop. Certainly you can experimentally verify Shannon's result, if you believe that LLMs are a sufficiently high fidelity model of English (you should - it takes multiple sentences before it's possible to sniff that text is LLM generated, a piece of information worth a single bit).

Oh look, Fabrice Bellard (who else?) already did it: https://bellard.org/ts_zip/ and you may note that indeed, it achieves a compression ratio of just north of 1 bit per byte, using a very small language model.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: