Hacker News new | past | comments | ask | show | jobs | submit login

English is about one bit per letter. If you type at a very fast 120WPM then you’re right at 10bps. Computers just don’t represent English very efficiently, even with gzip.



Even very fast typists are unable to do stenography without a machine specialized to the task. Speech, in turn, can usually be understood at two or even three times the rate at which it is ordinarily produced. Meanwhile, I can read several times faster than I can understand speech, even at the highest speedup which I find coherent.

Ergo, 10 bits per second just doesn't hold up. It's an interesting coincidence that a reasonably fast typing speed hits that rate, but humans routinely operate on language at multiples of it.


I don’t think a difference of this magnitude meaningfully changes what the paper is talking about. They already have other human behaviors in their table with bit rates up to 5 times higher. Even if you set it at 100bps it wouldn’t change much. They’re addressing a difference of eight orders of magnitude. Making it seven instead of eight isn’t all that important.


No but 10 bits/sec is better clickbait of a title, science my ass.


> English is about one bit per letter

Where did you get that number from? How would you represent a letter using 1 bit?


It’s an experimental result by Shannon: https://archive.org/details/bstj30-1-50/page/n5/mode/1up

In short, you show someone an English text cut off at an arbitrary point and ask them to predict which letter comes next. Based on how successful they are, you can calculate the information content of the text. The result from this experiment was approximately one bit per letter.

Representing it is not the concern of the experiment. I don’t think anyone has a scheme that can do this. But it’s straightforward enough in theory. You create a compressor which contains a simulated human English speaker. At each point, ask the simulation to rank all the letters that might come next, in order. Emit the rank of the actual next letter into your compressed data. To decompress, run the same procedure, but apply the ranks you read from the data stream to the simulation’s predictions. If your simulation is deterministic, this will produce output matching the compressor’s input.


Say that experiment is correct. Wouldn't that imply that the information context of a single letter varies based on the possible future permutations?

I.e., The string "I'v_" provides way more context than "con_" because you're much more likely to get I'm typing "I've" instead of "contraception"

That seems to disprove the idea that a letter is a bit.

Also the fact that there are more than two letters also indicate more than one bit, though I wouldn't want to even start to guess the encoding scheme of the brain


I don’t follow. Of course the probabilities change depending on context. 1 bit per letter is an average, not an exact measure for each individual letter. There are cases where the next letter is virtually guaranteed, and the information content of that letter is much less than one bit. There are cases where it could easily be many different possibilities and that’s more than one bit. On average it’s about one bit.

> Also the fact that there are more than two letters also indicate more than one bit

This seems to deny the possibility of data compression, which I hope you’d reconsider, given that this message has probably been compressed and decompressed several times before it gets to you.

Anyway, it should be easy to see that the number of bits per symbol isn’t tied to the number of symbols when there’s knowledge about the structure of the data. Start with the case where there are 256 symbols. That implies eight bits per symbol. Now take this comment, encode it as ASCII, and run it through gzip. The result is less than 8 bits per symbol.

For a contrived example, consider a case where a language has three symbols, A, B, and C. In this language, A appears with a frequency of 999,999,998 per billion. B and C each appear with a frequency of one in a billion. Now, take some text from this language and apply a basic run-length encoding to it. You’ll end up with something like 32 bits per billion letters on average (around 30 bits to encode a typical run length of approximately 1 billion, and 2 bits to encode which letter is in the run), which is way less than one bit per letter.


> I.e., The string "I'v_" provides way more context than "con_" because you're much more likely to get I'm typing "I've" instead of "contraception"

Yes the entropy of the next letter always depends on the context. One bit per letter is just an average for all kinds of contexts.

> Also the fact that there are more than two letters also indicate more than one bit

Our alphabet is simply not the most efficient way of encoding information. It takes about 5 bits to encode 26 letters, space, comma and period. Even simple algorithms like Huffman or LZ77 only require just 3 bits per letter. Current state-of-the-art algorithms compress the English Wikipedia using a mere 0.8 bits per character: https://www.mattmahoney.net/dc/text.html


>I don’t think anyone has a scheme that can do this

If you substitute "token", for "letter", what you have described is exactly what a modern LLM does, out of the box. llama.cpp even has a setting, "show logits", which emits the probability of each token (sadly, only of the text it outputs, not the text it ingests - oh well).

I don't think anyone actually uses this as a text compressor for reasons of practicality. But it's no longer a theoretical thought experiment - it's possible today, on a laptop. Certainly you can experimentally verify Shannon's result, if you believe that LLMs are a sufficiently high fidelity model of English (you should - it takes multiple sentences before it's possible to sniff that text is LLM generated, a piece of information worth a single bit).

Oh look, Fabrice Bellard (who else?) already did it: https://bellard.org/ts_zip/ and you may note that indeed, it achieves a compression ratio of just north of 1 bit per byte, using a very small language model.


In practice, it is even less. Current state-of-the-art algorithms compress the English Wikipedia using just 0.8 bits per character: https://www.mattmahoney.net/dc/text.html


Why do you think Wikipedia is an accurate representation of the English language? Is it just because it's large?

As an encyclopedia, it has an intentionally limited and factual way of describing things, which lacks a lot of important parts of language like poetry, allegory, metaphor, slang, regional dialects, and so on.

The fact that it can be compressed down so much probably just means it has a ton of repetition.


These letters are jointly distributed, and the entropy of the joint distribution of a second of "plausible" English text is much lower than the naive sum of the marginal entropies of each letter. In fact, with LLMs that report the exact probability distribution of each token, it is now possible to get a pretty decent estimate of what the entropy of larger segments of English text actually is.


What if you are typing not an English text, but a series of random letters? This gets you to 5-6 bits per letter.


I think this gets into what you consider to be “information.” Random noise is high entropy and thus high information in one sense, and zero information in another.


Well the information used in the article is the classical shannon's information, so the former. Though I suspect that the entropy of what we can actually "randomly" type is not that high.


> English is about one bit per letter.*

* when whole sentences or paragraphs are considered.


I’d say that is implied by “English.”

Entropy is a measure of the source, not output.


What else would we consider?


The symbols aka words of the language itself?


I’m afraid I don’t understand your point.

If someone types English for a minute at 120WPM then they’ll have produced about 600 bits of information.

Are you saying we should consider the rate in a smaller window of time? Or we should consider the rate when the typist is producing a series of unrelated English words that don’t form a coherent sentence?


From the paper:

Take for example a human typist working from a hand-written manuscript. An advanced typist produces 120 words per minute. If each word is taken as 5 characters, this typing speed corresponds to 10 keystrokes a second. How many bits of information does that represent? One is tempted to count the keys on the keyboard and take the logarithm to get the entropy per character, but that is a huge overestimate. Imagine that after reading the start of this paragraph you are asked what will be the next let…

English contains orderly internal structures that make the character stream highly predictable. In fact, the entropy of English is only ∼ 1 bit per character [1]. Expert typists rely on all this redundancy: if forced to type a random character sequence, their speed drops precipitously.

[1] Shannon CE. Prediction and Entropy of Printed English. Bell System Technical Journal. 1951;30(1):50-64.


How do you measure information density of English text?


You show a bunch of English speakers some text that’s cut off, and ask them to predict the next letter. Their success at prediction tells you the information content of the text. Shannon ran this experiment and got a result of about 1 bit per letter: https://archive.org/details/bstj30-1-50/page/n5/mode/1up


OK. When talking about language I find it's always good to be explicit about what level you're talking about, especially when you're using terms as overloaded as "information". I'm not really sure how to connect this finding to semantics.


If the text can be reproduced with one bit per letter, then the semantic information content is necessarily at most equal to N bits where N is the length of the text in letters. Presumably it will normally be much less, since there are things like synonyms and equivalent word ordering which don’t change the meaning, but this gives a solid upper bound.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: