English is about one bit per letter. If you type at a very fast 120WPM then you’...

samatman · 2024-12-18T15:37:55 1734536275

Even very fast typists are unable to do stenography without a machine specialized to the task. Speech, in turn, can usually be understood at two or even three times the rate at which it is ordinarily produced. Meanwhile, I can read several times faster than I can understand speech, even at the highest speedup which I find coherent.

Ergo, 10 bits per second just doesn't hold up. It's an interesting coincidence that a reasonably fast typing speed hits that rate, but humans routinely operate on language at multiples of it.

wat10000 · 2024-12-18T16:05:10 1734537910

I don’t think a difference of this magnitude meaningfully changes what the paper is talking about. They already have other human behaviors in their table with bit rates up to 5 times higher. Even if you set it at 100bps it wouldn’t change much. They’re addressing a difference of eight orders of magnitude. Making it seven instead of eight isn’t all that important.

freehorse · 2024-12-19T01:28:58 1734571738

No but 10 bits/sec is better clickbait of a title, science my ass.

esperent · 2024-12-18T13:15:50 1734527750

> English is about one bit per letter

Where did you get that number from? How would you represent a letter using 1 bit?

wat10000 · 2024-12-18T13:45:13 1734529513

It’s an experimental result by Shannon: https://archive.org/details/bstj30-1-50/page/n5/mode/1up

In short, you show someone an English text cut off at an arbitrary point and ask them to predict which letter comes next. Based on how successful they are, you can calculate the information content of the text. The result from this experiment was approximately one bit per letter.

Representing it is not the concern of the experiment. I don’t think anyone has a scheme that can do this. But it’s straightforward enough in theory. You create a compressor which contains a simulated human English speaker. At each point, ask the simulation to rank all the letters that might come next, in order. Emit the rank of the actual next letter into your compressed data. To decompress, run the same procedure, but apply the ranks you read from the data stream to the simulation’s predictions. If your simulation is deterministic, this will produce output matching the compressor’s input.

malfist · 2024-12-18T13:58:58 1734530338

Say that experiment is correct. Wouldn't that imply that the information context of a single letter varies based on the possible future permutations?

I.e., The string "I'v_" provides way more context than "con_" because you're much more likely to get I'm typing "I've" instead of "contraception"

That seems to disprove the idea that a letter is a bit.

Also the fact that there are more than two letters also indicate more than one bit, though I wouldn't want to even start to guess the encoding scheme of the brain

wat10000 · 2024-12-18T14:11:39 1734531099

I don’t follow. Of course the probabilities change depending on context. 1 bit per letter is an average, not an exact measure for each individual letter. There are cases where the next letter is virtually guaranteed, and the information content of that letter is much less than one bit. There are cases where it could easily be many different possibilities and that’s more than one bit. On average it’s about one bit.

> Also the fact that there are more than two letters also indicate more than one bit

This seems to deny the possibility of data compression, which I hope you’d reconsider, given that this message has probably been compressed and decompressed several times before it gets to you.

Anyway, it should be easy to see that the number of bits per symbol isn’t tied to the number of symbols when there’s knowledge about the structure of the data. Start with the case where there are 256 symbols. That implies eight bits per symbol. Now take this comment, encode it as ASCII, and run it through gzip. The result is less than 8 bits per symbol.

For a contrived example, consider a case where a language has three symbols, A, B, and C. In this language, A appears with a frequency of 999,999,998 per billion. B and C each appear with a frequency of one in a billion. Now, take some text from this language and apply a basic run-length encoding to it. You’ll end up with something like 32 bits per billion letters on average (around 30 bits to encode a typical run length of approximately 1 billion, and 2 bits to encode which letter is in the run), which is way less than one bit per letter.

taffer · 2024-12-18T15:01:24 1734534084

> I.e., The string "I'v_" provides way more context than "con_" because you're much more likely to get I'm typing "I've" instead of "contraception"

Yes the entropy of the next letter always depends on the context. One bit per letter is just an average for all kinds of contexts.

> Also the fact that there are more than two letters also indicate more than one bit

Our alphabet is simply not the most efficient way of encoding information. It takes about 5 bits to encode 26 letters, space, comma and period. Even simple algorithms like Huffman or LZ77 only require just 3 bits per letter. Current state-of-the-art algorithms compress the English Wikipedia using a mere 0.8 bits per character: https://www.mattmahoney.net/dc/text.html

dTal · 2024-12-18T15:51:53 1734537113

>I don’t think anyone has a scheme that can do this

If you substitute "token", for "letter", what you have described is exactly what a modern LLM does, out of the box. llama.cpp even has a setting, "show logits", which emits the probability of each token (sadly, only of the text it outputs, not the text it ingests - oh well).

I don't think anyone actually uses this as a text compressor for reasons of practicality. But it's no longer a theoretical thought experiment - it's possible today, on a laptop. Certainly you can experimentally verify Shannon's result, if you believe that LLMs are a sufficiently high fidelity model of English (you should - it takes multiple sentences before it's possible to sniff that text is LLM generated, a piece of information worth a single bit).

Oh look, Fabrice Bellard (who else?) already did it: https://bellard.org/ts_zip/ and you may note that indeed, it achieves a compression ratio of just north of 1 bit per byte, using a very small language model.

taffer · 2024-12-18T15:10:45 1734534645

In practice, it is even less. Current state-of-the-art algorithms compress the English Wikipedia using just 0.8 bits per character: https://www.mattmahoney.net/dc/text.html

esperent · 2024-12-21T08:58:38 1734771518

Why do you think Wikipedia is an accurate representation of the English language? Is it just because it's large?

As an encyclopedia, it has an intentionally limited and factual way of describing things, which lacks a lot of important parts of language like poetry, allegory, metaphor, slang, regional dialects, and so on.

The fact that it can be compressed down so much probably just means it has a ton of repetition.

ComplexSystems · 2024-12-18T18:12:16 1734545536

These letters are jointly distributed, and the entropy of the joint distribution of a second of "plausible" English text is much lower than the naive sum of the marginal entropies of each letter. In fact, with LLMs that report the exact probability distribution of each token, it is now possible to get a pretty decent estimate of what the entropy of larger segments of English text actually is.

codedokode · 2024-12-18T16:35:31 1734539731

What if you are typing not an English text, but a series of random letters? This gets you to 5-6 bits per letter.

wat10000 · 2024-12-18T16:39:54 1734539994

I think this gets into what you consider to be “information.” Random noise is high entropy and thus high information in one sense, and zero information in another.

freehorse · 2024-12-19T02:36:25 1734575785

Well the information used in the article is the classical shannon's information, so the former. Though I suspect that the entropy of what we can actually "randomly" type is not that high.

formerly_proven · 2024-12-18T13:14:51 1734527691

> English is about one bit per letter.*

* when whole sentences or paragraphs are considered.

beng-nl · 2024-12-18T14:49:03 1734533343

I’d say that is implied by “English.”

Entropy is a measure of the source, not output.

wat10000 · 2024-12-18T13:16:12 1734527772

What else would we consider?

formerly_proven · 2024-12-18T13:16:40 1734527800

The symbols aka words of the language itself?

wat10000 · 2024-12-18T13:25:21 1734528321

I’m afraid I don’t understand your point.

If someone types English for a minute at 120WPM then they’ll have produced about 600 bits of information.

Are you saying we should consider the rate in a smaller window of time? Or we should consider the rate when the typist is producing a series of unrelated English words that don’t form a coherent sentence?

mannykannot · 2024-12-18T13:50:40 1734529840

From the paper:

Take for example a human typist working from a hand-written manuscript. An advanced typist produces 120 words per minute. If each word is taken as 5 characters, this typing speed corresponds to 10 keystrokes a second. How many bits of information does that represent? One is tempted to count the keys on the keyboard and take the logarithm to get the entropy per character, but that is a huge overestimate. Imagine that after reading the start of this paragraph you are asked what will be the next let…

English contains orderly internal structures that make the character stream highly predictable. In fact, the entropy of English is only ∼ 1 bit per character [1]. Expert typists rely on all this redundancy: if forced to type a random character sequence, their speed drops precipitously.

[1] Shannon CE. Prediction and Entropy of Printed English. Bell System Technical Journal. 1951;30(1):50-64.

rvense · 2024-12-18T13:37:59 1734529079

How do you measure information density of English text?

wat10000 · 2024-12-18T13:46:47 1734529607

You show a bunch of English speakers some text that’s cut off, and ask them to predict the next letter. Their success at prediction tells you the information content of the text. Shannon ran this experiment and got a result of about 1 bit per letter: https://archive.org/details/bstj30-1-50/page/n5/mode/1up

rvense · 2024-12-18T14:18:40 1734531520

OK. When talking about language I find it's always good to be explicit about what level you're talking about, especially when you're using terms as overloaded as "information". I'm not really sure how to connect this finding to semantics.

wat10000 · 2024-12-18T14:24:25 1734531865

If the text can be reproduced with one bit per letter, then the semantic information content is necessarily at most equal to N bits where N is the length of the text in letters. Presumably it will normally be much less, since there are things like synonyms and equivalent word ordering which don’t change the meaning, but this gives a solid upper bound.