Hacker News new | past | comments | ask | show | jobs | submit login
How to Write a Spelling Corrector (norvig.com)
160 points by ashishgandhi on Oct 5, 2012 | hide | past | favorite | 24 comments




Also submitted yesterday. That would explain the ? at the end of the url.

http://news.ycombinator.com/item?id=4609321

Not complaining though; this article is definitely worth a read for those that missed it the first few times around.


A classic article and a fantastic introduction to data-intensive natural language processing. I remember implementing Norvig's spelling corrector to improve coverage in Zend_Search_Lucene back before we had moved to Solr at one of my old jobs. The cool part is that you don't have to use a general dicitonary; you can use one that only includes the set of terms on your site -- thereby making your spelling corrector specially targeted towards your domain language.


Interestingly, this does not seem to be how Google does their spelling correction. Everything I've read implies that Google looks for human corrections in search queries, and then extrapolates those into suggestions. in other words, "People who searched for 'speling' often click no results, but search for 'spelling' instead.


An older treatment of the problem, which I personally find to be clearer:

Kernighan, M., Church, K., Gale, W (1990) “A Spelling Correction Program Based on a Noisy Channel Model,” Coling, Helsinki , Finland. http://acl.ldc.upenn.edu/C/C90/C90-2036.pdf


Hey, just wanted to say thanks for this. It's very helpful to my projects. I had collected a number of historical papers on spell checking and I thought I had most all of Kernighan's Bell Lab papers, but I missed this one. And it is indeed very clear. Cheers.


While we are on topic, can someone shed some light on the reason behind Mac OSX spelling corrector being so inferior to the one used in Microsoft products? I'm a big fan of Apple overall, but I find myself turning to MS Word to write what I really care about. Being in English or Spanish, Microsoft's corrector is light years ahead of Apple.


The best I've used is Google's context sensitive spell check in Docs. It only returns one answer and in my experience is almost always right.


Could someone please explain how this statement works:

> set(e2 for e1 in edits1(word) for e2 in edits1(e1))

I understand how you can say something like "f(x) for x in edits1(word)", but the way it's used above with multiple fors is making my brain hurt.


Something roughly equivalent to:

  s = set()
  for e1 in edits1(word):
      for e2 in edits1(e1):
          s.add(e2)


The one bit of sugar in Python that always throws me off. Since the inner argument is written first in the comprehension, my intuition is that I'll add the iteration statements as I expand out to a slightly-higher loop. Nope!


It's not necessarily the inner argument that's written first:

    >>> [a for a in range(3) for b in range(2)]
    [0, 0, 1, 1, 2, 2]
It's just that for every iteration of the inner loop, it is evaluated and added to the result.

Once you keep in mind that their order has the same meaning as with normal for loops it's not so tricky. You can even indent them that way to make that more obvious.



Another classic is his article on solving sudoku puzzles.

http://norvig.com/sudoku.html

Though languages with constraint systems built in have even nicer solutions (e.g. Oz)


This article was very useful for me once. I used a similar approach to normalize a database of search terms ('tags') for a big site.


That's very interesting, can you provide any more detail?


Yes. I had to evolve a very crude CMS the company was using, where all content contained a text field with search terms. Because the interface and the database didn't enforced any normalization, editors generated many variations of the same terms (e.g., "batman", "the batman", "batman the dark knight"). This has obvious problems regarding information architecture.

This was solved by using an algorithm similar to the described in the OP, like a spelling corrector: for each content on the CMS (e.g., article), get the search terms and, using the entire database of search terms as a dictionary (adjusted for frequency), spill out the likely candidates as normalized "tags". That is, for a blog post where the search terms was a string like "batman, batman the dark knight rises, batman dc comics", it would be normalized to an array like ['batman', 'the dark knight rises', 'd.c. comics'].

I also used the same approach on a previous project, to normalize addresses in a real estate database (Levensthein distance). Yes, it's that useful :)


In order to find all the words from a large collection that are within K distance of the searched word, you can use a trie:

http://blog.vjeux.com/2011/c/c-fuzzy-search-with-trie.html



I found and was reading this article just yesterday. What a coincidence! This article is great resource.


Can someone tell Evernote? Their corrections are notoriously bad.


Spelling error, second paragraph. Irony.


Ah, but every word in that paragraph is a correctly spelled word. The paragraph would sale right through the spelling checker.


Ha. Funny.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: