Hacker News new | past | comments | ask | show | jobs | submit login
English Letter Frequency Counts: Mayzner Revisited (norvig.com)
180 points by phenylene on Jan 5, 2013 | hide | past | favorite | 26 comments



I did something similar with a 10m Tweet dataset couple of years ago:

* http://ktype.net/wiki/research:articles:progress_20110209#tw...

* http://ktype.net/wiki/research:articles:progress_20110228?s#...

I would love to redo this analysis with newer Tweets but alas, don't know where to get a usable corpus. Any suggestions? My goal was to explore many of the concepts from norvig's http://norvig.com/spell-correct.html using the Tweet dataset to build a better one-finger-keyboard and word-prediction engine for iOS.


I would think tweets would be skewed, no? Slang, memes, shrtnd txt 2 avoid char lmts, http://urls/, amongst others?


No corpus is immune from comparison, and each will have statistical parameters that reflect it's original selection criteria. Perhaps Mayzner's corpus, apparently based on a sample from literature, exhibits a bias away from the abbreviated forms widely used in written communication today.

So, if you wanted to tune your text prediction software for your phone...


Precisely. I was looking for predicting informal communication patterns, not formal book/newspaper style.


Unfortunately Twitter no longer allow redistribution of tweet content (http://readwrite.com/2011/03/03/how_recent_changes_to_twitte...). Datasift and Gnip offer commercial access to the full firehose, but for a sample it wouldn't take long to build your own corpus from the official Twitter streaming or search API.


@chime: I am working on something similar and it would be great to exchange ideas if you want to chat about it.


Since the original data includes the year of publication, it would be interesting to see trends in these datasets over time, eg which words are becoming more/less popular, is average word length reducing over time, is the variety of words in common use increasing or decreasing, etc.


You might be into corpus linguistics...


I thought this may help to design a new keyboard layout better (from scientific/statistic perspective) than Dvorak (https://en.wikipedia.org/wiki/Dvorak_Simplified_Keyboard) which is based on research more than 80 years ago.


But people don't write just English words. Norvig dismisses any word with any character other than a-z, but those words still get typed. As do numbers, punctuation and not-real-words like names.

Redesigning a keyboard should be based on what people need to type, rather than on how English words are structured.


The results seem consistent with previous data. There are many recent layouts (e.g. Colemak[1]) which attempt further optimization, but simulations show little actual difference in strain (~5%) between those. They are all better than qwerty in this regard, but optimizing any any more is a clear case of diminishing returns.

It is actually much more difficult to model finger strain that the english language (in terms of n-grams). Subjective assessments vary a lot, and the quest for the optimal layout is bathed in controversy. Beyond switching away from qwerty, the most significant gains will be made by hardware solutions, like using an ergonomic keyboard (such as the very promising ErgoDox [2]). Other optimizations may come in the form of chorded keyers and better predictive technology. In this last case, the Google data may prove useful.

[1] http://colemak.com/

[2] http://deskthority.net/workshop-f7/split-ergonomic-keyboard-...


@gokfar: Seems like you have some background in keyboard layout optimization. Are you currently working on anything related now? I am and would great to chat with you about it if you're in the MV area.


On a tangential note, does anyone know what he's using to generate those bar graphs automatically from those tiny images? Nifty trick.


Mr Norvig, please share the code. I'm sure it's some interesting Lisp or Python.


Norvig's spell corrector is the nicest code I've ever read. But every analysis he makes here is just counting, there is nothing computationally interesting happening. I doubt the code is really that much to look at.


It's not apparent to me what was used when calculating the frequency of bigrams, n-grams. Was this based on the overall dataset or the dictionary alone?

I believe it's beneficial to see a version that's based on the dictionary words alone as that would ensure no duplicate words exist to effect the n-grams, acting as a control group.


Many years ago I read an article claiming the order is etoanirsh... I still use it in hangman games.



The real order is this one...

http://en.wikipedia.org/wiki/ETAION_SHRDLU


You should read the linked article.


Service Temporarily Unavailable The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later. Apache/1.3.42 Server at norvig.com Port 80


But "forschungsgemeinschaft" is German, and it's mentioned frequently: at least 100,000 times each in the book corpus.

I don't trust his corpus compares to the original English corpus.


As you might have seen, all those words with twenty letters or more are very weird. That’s because there aren’t very many commonly used long English words.

Also, at least 100,000 times isn’t so frequent, given Google’s huge corpus.

Here is the relative frequency of “Forschungsgemeinschaft” over time: http://books.google.com/ngrams/graph?content=Forschungsgemei...

Here are results on Google Books when searching for “Forschungsgemeinschaft”: http://www.google.com/search?q=%22Forschungsgemeinschaft%22&...

Google doesn’t promise that their English corpus only contains English books, just that it contains mostly English books. Given the raw size of their corpus this algorithmic solution seems like a reasonable tradeoff to me.

The frequency of “Forschungsgemeinschaft” is also easy enough to explain: The “Deutsche Forschungsgemeinschaft” (German Research Foundation) provides lots of funding for research in Germany. It is consequently mentioned in lots of research papers, many of which are written in English.


A lot of foreign words are frequently met in English texts, especially in narrow domains like philosophy, biology and medicine, etc.

And seing that:

http://en.wikipedia.org/wiki/Deutsche_Forschungsgemeinschaft

is a research institute, one would expect tons of mentions for this word in _english_ scientific papers.

But that is an outlier, and given the vastness of the corpus those would have been sorted out.

That said, there would have been an easy way to filter non English books out of the way automatically: do a statistical analysis on each book (e.g on letter frequency) and reject the ones that stray too much from the norm -- or send them to a secondary filtering stage, e.g by word presence or a verification by a human. Done carefully that filtering would not harm the actual results at all (e.g by presuposing a specific letter frequency, because it would only reject extreme outliers that would indeed by non-english works).


It’s not a research institute. It gives grants to researchers (it’s the biggest organization in Germany giving research grants), it’s a foundation. As such it is also often mentioned in scientific papers. (“This study was funded in part by a grant from the …”)


How do the new results compare to Mayzner's?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: