A classic article and a fantastic introduction to data-intensive natural language processing. I remember implementing Norvig's spelling corrector to improve coverage in Zend_Search_Lucene back before we had moved to Solr at one of my old jobs. The cool part is that you don't have to use a general dicitonary; you can use one that only includes the set of terms on your site -- thereby making your spelling corrector specially targeted towards your domain language.
Interestingly, this does not seem to be how Google does their spelling correction. Everything I've read implies that Google looks for human corrections in search queries, and then extrapolates those into suggestions. in other words, "People who searched for 'speling' often click no results, but search for 'spelling' instead.
An older treatment of the problem, which I personally find to be clearer:
Kernighan, M., Church, K., Gale, W (1990) “A Spelling Correction Program Based on a Noisy Channel Model,” Coling, Helsinki , Finland.
http://acl.ldc.upenn.edu/C/C90/C90-2036.pdf
Hey, just wanted to say thanks for this. It's very helpful to my projects. I had collected a number of historical papers on spell checking and I thought I had most all of Kernighan's Bell Lab papers, but I missed this one. And it is indeed very clear. Cheers.
While we are on topic, can someone shed some light on the reason behind Mac OSX spelling corrector being so inferior to the one used in Microsoft products? I'm a big fan of Apple overall, but I find myself turning to MS Word to write what I really care about. Being in English or Spanish, Microsoft's corrector is light years ahead of Apple.
The one bit of sugar in Python that always throws me off. Since the inner argument is written first in the comprehension, my intuition is that I'll add the iteration statements as I expand out to a slightly-higher loop. Nope!
It's not necessarily the inner argument that's written first:
>>> [a for a in range(3) for b in range(2)]
[0, 0, 1, 1, 2, 2]
It's just that for every iteration of the inner loop, it is evaluated and added to the result.
Once you keep in mind that their order has the same meaning as with normal for loops it's not so tricky. You can even indent them that way to make that more obvious.
Yes. I had to evolve a very crude CMS the company was using, where all content contained a text field with search terms. Because the interface and the database didn't enforced any normalization, editors generated many variations of the same terms (e.g., "batman", "the batman", "batman the dark knight"). This has obvious problems regarding information architecture.
This was solved by using an algorithm similar to the described in the OP, like a spelling corrector: for each content on the CMS (e.g., article), get the search terms and, using the entire database of search terms as a dictionary (adjusted for frequency), spill out the likely candidates as normalized "tags". That is, for a blog post where the search terms was a string like "batman, batman the dark knight rises, batman dc comics", it would be normalized to an array like ['batman', 'the dark knight rises', 'd.c. comics'].
I also used the same approach on a previous project, to normalize addresses in a real estate database (Levensthein distance). Yes, it's that useful :)
http://news.ycombinator.com/item?id=10967 (2005 days ago!)
http://news.ycombinator.com/item?id=42587
http://news.ycombinator.com/item?id=665544
http://news.ycombinator.com/item?id=2034981