Hacker News new | past | comments | ask | show | jobs | submit login
What determines word length? Not frequency after all. (web.mit.edu)
70 points by ColinWright on May 23, 2011 | hide | past | favorite | 27 comments



Uhm, basic linguistics teaches you about phonemes and morphemes. Phonemes are the smallest meaning-differentiating unit of a language, and a morpheme is the smallest meaning-carrying unit of a language.

Now, the length of a word is pretty arbitrary, it's a function of how the phonemes in the word are spelled out, but it doesn't say much about the morphemes of a word. Yes, generally a longer word can contain more morphemes, and thus more meaning, but, well, duh.

Something that would be genuinely interesting though is to compare the length of morphemes with their frequency or their complexity of meaning.


Exactly. That was my thought #1, phonetically rough=ruf. Thought #2: I'd love to see the study repeated in my native German, which is nearly phonetic and has a different way of constructing words. My wife's favorite (she's American): Glove = Handschuh, a shoe for your hands.

Flipping through and English-German dictionary you'll find a lot of specific English nouns that have longer compound nouns in German.


The rough <-> ruf example is a great one. One would have to argue that the "ough" spelling conveyed additional meaning in both written and spoken forms in order to support the hypothesis of the researchers, or else such examples are such a small subset of words studied that they are insignificant to the larger hypothesis.

I am somewhat aware of thinking about the etymology of word when using them. For example, whenever I use the word conspire I picture two people breathing together. Perhaps if I didn't picture this I'd just use the word "plot" instead in such cases.


"But now a team of MIT cognitive scientists has developed an alternative notion, on the basis of new research: A word’s length reflects the amount of information it contains."

I'm trying very, very hard not to say "DOH!". I mean, isn't this the expected result?


If you assume a view in which syntax produces a predictable semantics, then yes. The number 123321 contains more information than 3. But natural languages do not have a semantics produced by a syntax in an easily predictable way. I think it isn't a novel result, but for different reasons.


The number 123321 contains more information than 3.

This isn't true, or at least, it isn't true most relevant ways.

Most obviously, if I need to specify which record in my database (of, say, a million records) is to be read, then "3" and "123321" contain precisely the same amount of information.

But I think that even in more common cases: "how many birds are at the bird feeder?"; "how many gallons did it take to fill you gas tank?"; "what's you GPA?", the answer, regardless of its magnitude, conveys the same amount of information.

Now, precision is a different story. "I'm going to take a couple of weeks vacation in July" contains less information than "I'm going to take 11 days of vacation in July". "My gas tank has a capacity of 16 gallons" contains less information than "my gas tank has a capacity of 15.9 gallons.".


There is interplay. "There are on the order of fifty birds at the bird feeder" is more informative than "there are on the order of zero birds at the bird feeder".


But then again string A) 111111111111111 contains less information than string B) 48611 as string A can be written as a small program: 1*15.

Like in the article it could be far more likely to have a 1 follow a 1 (like markov chaining probability) than another number, making the string A more probable and therefor containing less information.

The overarching notion to me seems to be (Kolmogorov) complexity. The size of the smallest program that can reproduce the string/word/utterance denotes the complexity of that string/word/utterance.

We use less complex words more often, not because they are shorter, but because their optimal compression ratio (not the number of morphenes, but their order and complexity) is shorter than other words. The program to produce it in our brains is smaller and simpler.


Suppose then I run a lottery, drawing 5 decimal digits. We have a special deal though -- if all 5 digits match, we draw again, up to 3 more times, and award a unique prize to the ticket holder for each result. We always sell precisely 10000 tickets (you don't get to pick your ticket).

Last week, the results were 11111 11111 11111. This week, the results are 48611.

In this case, would you suggest that last week's results convey less information? It does convey overall more facts than this week's results.


For me it is more clear to say that not the absolute probability, but the conditional probability of a word (in its typical contexts) determines word length. Now thinking about it: it is trivial that this is optimal for efficiency. Not a very deep result.


One thing that isn't trivial is the nature of the constraints that produce the various designs of human language. Are they constrained by some sort of innate "Universal Grammar" coded for in DNA? Or by ease of learning, given general learning mechanisms (not specific to language)? Or by efficiency of use? Or by a balance of information theoretic efficiency (high bandwidth) and redundancy (high accuracy despite noise)? Or something else more significant? Maybe some form of flexibility (greater efficiency in low-noise and redundancy in high)? Or all of the above, in differing proportions for different aspects of language?

We need to measure a lot of things in a lot of different ways in a lot of different languages before the contributions of the various constraints reflected in "the design" become clear. It's not trivial that we will discover (nor has this study come close to proving) that "optimal efficiency" is the bottom line.


It may not seem deep if you're knowledgeable about information theory, but information theory is still primarily known about only by mathematicians, cutting-edge physicists, and computer scientists with a lot of academic study (i.e., your average four-year graduate probably has never even heard of it). It really hasn't penetrated anywhere near as far as it needs to. In the meantime there's a lot of rather "obvious" papers in numerous disciplines to be written that amount to "Hey, information theory is cool and helpful to us" yet.


I wonder about the causation here. Maybe short words get used more often, such that their meaning is diluted and the information contained decreases.

An interesting (counter?) example is "use" vs "usage" vs "utilization".


The example compared a declarative sentence and an idiom.

This study sounds like the old Norman nonsense that turned perfectly good Anglo-Saxon words into the vulgar, lower-class vocabulary where it concerns nouns and verbs. Regarding articles it is obvious since articles are combinatory by definition, and the "finding" does not exclude "the," "la" etc also being short for convenience's sake.


Seems to me that this could just be a reversal of cause and effect. Basically, short words more easily engender idioms containing them. For example, "It's getting late, wanna grab lunch?" combines a couple short words into two common phrases, whereas "It's mid-afternoon, I'm hankering for sustenance" does not.


This seems like it would also apply quite well to variable naming in programming. Naming a variable 'a' just because it's frequently used would be misguided.


Yes, but there is a tradition (at least in some languages) to name cycle variable as i for example: for (int i=0;i<n;i++).


This seems to bring additional evidence to the information content hypothesis: because it is common and expected it does not have a high information density.


No surprises there for anyone who has ever played enough Scrabble to sit down and memorise the list of two-letter words. Aa? Qi? Ne? Ae, anyone?


First step to relate linguistic to entropy. Read physics meets language. Expect another decade to answer why words carry meanings.


What type of information density does "God" have?

None or infinite?


It'd be pretty low, considering the cardinality of the set of attested "gods" ;)


Richard Dawkins likes to say that atheists and Christians are almost identical. They have both rejected the existence of huge numbers of alleged gods; the atheist just adds one more to that list.


That's kind of like saying Jeffrey Dahmer and I are also almost identical, since there are billions of people neither of us have killed and eaten. Bill Clinton and I are almost identical, since there are so many nations neither of us have been the president of. And so on.

In many contexts, the difference between zero and one is extremely significant.


or it's a combination of both...


The explanation seems a little sloppy. If they used Markov Chains, then chains of length 1 is the same as frequency. A chain of length > 1 will be a slightly better fit.

They've generalized the model, and their generalization seems to work a bit better.


That seems right, and makes the write-up a bit strange. They seem to be positioning it as if they're comparing two completely unrelated hypotheses: the frequency hypothesis versus the information-content hypothesis. But as you point out, their method of measuring information content (following Shannon) is simply n-gram frequency. The difference is that they set n=2,3,4 rather than n=1.

It's also not clear that it differs from the original motivation for the frequency hypothesis: (some) people advancing the "more common words are shorter" hypothesis come in part from something like a compression argument, that common words are short for efficiency reasons. Arguing that predictable (low-information-content) words are short instead is an interesting refinement, but not a completely different claim. Just choosing string length by individual word frequency is actually a sort of crappy compression algorithm, and this paper seems to show that English's built-in compression is better than that, and takes sequence frequencies into account.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: