How to Write a Spelling Corrector

vanni · on Oct 5, 2012

Previous discussions:

http://news.ycombinator.com/item?id=10967 (2005 days ago!)

http://news.ycombinator.com/item?id=42587

http://news.ycombinator.com/item?id=665544

http://news.ycombinator.com/item?id=2034981

arscan · on Oct 5, 2012

Also submitted yesterday. That would explain the ? at the end of the url.

http://news.ycombinator.com/item?id=4609321

Not complaining though; this article is definitely worth a read for those that missed it the first few times around.

languagehacker · on Oct 5, 2012

A classic article and a fantastic introduction to data-intensive natural language processing. I remember implementing Norvig's spelling corrector to improve coverage in Zend_Search_Lucene back before we had moved to Solr at one of my old jobs. The cool part is that you don't have to use a general dicitonary; you can use one that only includes the set of terms on your site -- thereby making your spelling corrector specially targeted towards your domain language.

jaylevitt · on Oct 5, 2012

Interestingly, this does not seem to be how Google does their spelling correction. Everything I've read implies that Google looks for human corrections in search queries, and then extrapolates those into suggestions. in other words, "People who searched for 'speling' often click no results, but search for 'spelling' instead.

utopkara · on Oct 5, 2012

An older treatment of the problem, which I personally find to be clearer:

Kernighan, M., Church, K., Gale, W (1990) “A Spelling Correction Program Based on a Noisy Channel Model,” Coling, Helsinki , Finland. http://acl.ldc.upenn.edu/C/C90/C90-2036.pdf

gogetter · on Oct 6, 2012

Hey, just wanted to say thanks for this. It's very helpful to my projects. I had collected a number of historical papers on spell checking and I thought I had most all of Kernighan's Bell Lab papers, but I missed this one. And it is indeed very clear. Cheers.

mmvvaa · on Oct 5, 2012

While we are on topic, can someone shed some light on the reason behind Mac OSX spelling corrector being so inferior to the one used in Microsoft products? I'm a big fan of Apple overall, but I find myself turning to MS Word to write what I really care about. Being in English or Spanish, Microsoft's corrector is light years ahead of Apple.

noart · on Oct 5, 2012

The best I've used is Google's context sensitive spell check in Docs. It only returns one answer and in my experience is almost always right.

rand_r · on Oct 5, 2012

Could someone please explain how this statement works:

> set(e2 for e1 in edits1(word) for e2 in edits1(e1))

I understand how you can say something like "f(x) for x in edits1(word)", but the way it's used above with multiple fors is making my brain hurt.

seanl · on Oct 5, 2012

Something roughly equivalent to:

  s = set()
  for e1 in edits1(word):
      for e2 in edits1(e1):
          s.add(e2)

textminer · on Oct 5, 2012

The one bit of sugar in Python that always throws me off. Since the inner argument is written first in the comprehension, my intuition is that I'll add the iteration statements as I expand out to a slightly-higher loop. Nope!

andreasvc · on Oct 5, 2012

It's not necessarily the inner argument that's written first:

    >>> [a for a in range(3) for b in range(2)]
    [0, 0, 1, 1, 2, 2]

It's just that for every iteration of the inner loop, it is evaluated and added to the result.

Once you keep in mind that their order has the same meaning as with normal for loops it's not so tricky. You can even indent them that way to make that more obvious.

notaddicted · on Oct 5, 2012

Cached:

[Edit: is was 503ing but it works for me now. Also, (2007)]

1. http://norvig.com.nyud.net/spell-correct.html?

2. http://web.archive.org/web/20110717135116/http://norvig.com/...

3. http://webcache.googleusercontent.com/search?q=cache:q9g8wF3...

etrain · on Oct 5, 2012

Another classic is his article on solving sudoku puzzles.

http://norvig.com/sudoku.html

Though languages with constraint systems built in have even nicer solutions (e.g. Oz)

hcarvalhoalves · on Oct 5, 2012

This article was very useful for me once. I used a similar approach to normalize a database of search terms ('tags') for a big site.

aeontech · on Oct 6, 2012

That's very interesting, can you provide any more detail?

hcarvalhoalves · on Oct 8, 2012

Yes. I had to evolve a very crude CMS the company was using, where all content contained a text field with search terms. Because the interface and the database didn't enforced any normalization, editors generated many variations of the same terms (e.g., "batman", "the batman", "batman the dark knight"). This has obvious problems regarding information architecture.

This was solved by using an algorithm similar to the described in the OP, like a spelling corrector: for each content on the CMS (e.g., article), get the search terms and, using the entire database of search terms as a dictionary (adjusted for frequency), spill out the likely candidates as normalized "tags". That is, for a blog post where the search terms was a string like "batman, batman the dark knight rises, batman dc comics", it would be normalized to an array like ['batman', 'the dark knight rises', 'd.c. comics'].

I also used the same approach on a previous project, to normalize addresses in a real estate database (Levensthein distance). Yes, it's that useful :)

vjeux · on Oct 5, 2012

In order to find all the words from a large collection that are within K distance of the searched word, you can use a trie:

http://blog.vjeux.com/2011/c/c-fuzzy-search-with-trie.html

smackmybishop · on Oct 5, 2012

Jon Bentley's: http://dl.acm.org/citation.cfm?id=315102&dl=ACM&coll...

alecdibble · on Oct 5, 2012

I found and was reading this article just yesterday. What a coincidence! This article is great resource.

d23 · on Oct 5, 2012

Can someone tell Evernote? Their corrections are notoriously bad.

rbellio · on Oct 5, 2012

Spelling error, second paragraph. Irony.

Stratoscope · on Oct 6, 2012

Ah, but every word in that paragraph is a correctly spelled word. The paragraph would sale right through the spelling checker.

rbellio · on Oct 7, 2012

Ha. Funny.