I would love to redo this analysis with newer Tweets but alas, don't know where to get a usable corpus. Any suggestions? My goal was to explore many of the concepts from norvig's http://norvig.com/spell-correct.html using the Tweet dataset to build a better one-finger-keyboard and word-prediction engine for iOS.
No corpus is immune from comparison, and each will have statistical parameters that reflect it's original selection criteria.
Perhaps Mayzner's corpus, apparently based on a sample from literature, exhibits a bias away from the abbreviated forms widely used in written communication today.
So, if you wanted to tune your text prediction software for your phone...
Unfortunately Twitter no longer allow redistribution of tweet content (http://readwrite.com/2011/03/03/how_recent_changes_to_twitte...). Datasift and Gnip offer commercial access to the full firehose, but for a sample it wouldn't take long to build your own corpus from the official Twitter streaming or search API.
Since the original data includes the year of publication, it would be interesting to see trends in these datasets over time, eg which words are becoming more/less popular, is average word length reducing over time, is the variety of words in common use increasing or decreasing, etc.
I thought this may help to design a new keyboard layout better (from scientific/statistic perspective) than Dvorak (https://en.wikipedia.org/wiki/Dvorak_Simplified_Keyboard) which is based on research more than 80 years ago.
But people don't write just English words. Norvig dismisses any word with any character other than a-z, but those words still get typed. As do numbers, punctuation and not-real-words like names.
Redesigning a keyboard should be based on what people need to type, rather than on how English words are structured.
The results seem consistent with previous data. There are many recent layouts (e.g. Colemak[1]) which attempt further optimization, but simulations show little actual difference in strain (~5%) between those. They are all better than qwerty in this regard, but optimizing any any more is a clear case of diminishing returns.
It is actually much more difficult to model finger strain that the english language (in terms of n-grams). Subjective assessments vary a lot, and the quest for the optimal layout is bathed in controversy. Beyond switching away from qwerty, the most significant gains will be made by hardware solutions, like using an ergonomic keyboard (such as the very promising ErgoDox [2]). Other optimizations may come in the form of chorded keyers and better predictive technology. In this last case, the Google data may prove useful.
@gokfar: Seems like you have some background in keyboard layout optimization. Are you currently working on anything related now? I am and would great to chat with you about it if you're in the MV area.
Norvig's spell corrector is the nicest code I've ever read. But every analysis he makes here is just counting, there is nothing computationally interesting happening. I doubt the code is really that much to look at.
It's not apparent to me what was used when calculating the frequency of bigrams, n-grams. Was this based on the overall dataset or the dictionary alone?
I believe it's beneficial to see a version that's based on the dictionary words alone as that would ensure no duplicate words exist to effect the n-grams, acting as a control group.
Service Temporarily Unavailable
The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.
Apache/1.3.42 Server at norvig.com Port 80
As you might have seen, all those words with twenty letters or more are very weird. That’s because there aren’t very many commonly used long English words.
Also, at least 100,000 times isn’t so frequent, given Google’s huge corpus.
Google doesn’t promise that their English corpus only contains English books, just that it contains mostly English books. Given the raw size of their corpus this algorithmic solution seems like a reasonable tradeoff to me.
The frequency of “Forschungsgemeinschaft” is also easy enough to explain: The “Deutsche Forschungsgemeinschaft” (German Research Foundation) provides lots of funding for research in Germany. It is consequently mentioned in lots of research papers, many of which are written in English.
is a research institute, one would expect tons of mentions for this word in _english_ scientific papers.
But that is an outlier, and given the vastness of the corpus those would have been sorted out.
That said, there would have been an easy way to filter non English books out of the way automatically: do a statistical analysis on each book (e.g on letter frequency) and reject the ones that stray too much from the norm -- or send them to a secondary filtering stage, e.g by word presence or a verification by a human. Done carefully that filtering would not harm the actual results at all (e.g by presuposing a specific letter frequency, because it would only reject extreme outliers that would indeed by non-english works).
It’s not a research institute. It gives grants to researchers (it’s the biggest organization in Germany giving research grants), it’s a foundation. As such it is also often mentioned in scientific papers. (“This study was funded in part by a grant from the …”)
* http://ktype.net/wiki/research:articles:progress_20110209#tw...
* http://ktype.net/wiki/research:articles:progress_20110228?s#...
I would love to redo this analysis with newer Tweets but alas, don't know where to get a usable corpus. Any suggestions? My goal was to explore many of the concepts from norvig's http://norvig.com/spell-correct.html using the Tweet dataset to build a better one-finger-keyboard and word-prediction engine for iOS.