Hacker News new | past | comments | ask | show | jobs | submit login
A Spellchecker Used to Be a Major Feat of Software Engineering (dadgum.com)
94 points by tchalla on Oct 11, 2012 | hide | past | favorite | 70 comments



I wrote a spellchecker in the 80's for the Commodore 64. Step 1 was writing an efficient data transfer algorithm to move the bits off of the disk so that the 1541 drive could hit the sector interleave (it would miss it using its built-in data transfer algorithm). This sped up IO by 500%. But for reference that was still only about ~25kbps, or a bit slower than a 28.8kbps modem. Step 2 was coming up with a data compression algorithm that exploited the distribution of the top 30 or so 2/3-grams in English. I had a dictionary of about 80K words.

I can confirm this is on the order of several months worth of work :)


This seems fascinating. Can you share a more detailed description?

Like, what kind of data compression algo you chose at that time and what were the choices?

How did the techniques you developed at that time helped you later and what significance/standing they have in today's programming world?

How did your spell checker turn out in terms of performance and memory?

Were there better implementations around at that time? Which one was widely accepted as the best one?

Thanks in advance!


Happy to. I was 15 at the time, so I had limited knowledge of algorithms. So, I invented my own :)

IIRC I packed each character into 6 bits, using the first 26 values for A-Z, and the rest of the values for the common 2-3 grams that I found in the back of a dictionary in the local library. I had some other scheme for building words from stem + suffixes (eg flood, flooded).

What I learned the most from that experience was mostly around reverse-engineering and perseverance. I had to figure out what the 1541 drive did by disassembling its ROM and following the boot sequence. When comparing it to a 2040 drive (a older dual drive variant that was much more expensive - something like $2500 in 1981 dollars) I knew quite a bit about the sector interleave spacing - the disk could do sustained 2 sector reads per revolution. So I knew there was room for improvement if I could just get the transfer rate between the disk and the C-64 to go faster. I did that by improving how it transferred bits down the wire. To cut costs, Commodore had a serial interface vs. the parallel IEEE-488 interface used in the 2040 / Commodore PETs of the previous generation. They had a super inefficient algorithm where they did a handshake for every bit that went down the wire. This was due to the VIC chip in the C-64 stealing cycles from the 6510 CPU whenever it needed to read from shared RAM. Since they weren't clever enough to figure out when the VIC chip would steal cycles they did a handshake for every bit.

I improved this by asynchronously transferring 2 bits at a time using the clock and the data lines for ~3 bytes. I could read a register on the VIC chip that would tell me what scan line was being processed and from that I could tell when the VIC chip would steal the bus. Also on the vertical flyback I would be able to transfer bits with impunity.

This speed boost made it possible for me to hit the interleave on the disk and get the transfer back up to ~25Kbps.

These were pretty low level hacks (that worked quite well) to get things to go fast. Not sure how any of that transfers to the things we do today. One thing's for sure, I don't think I'll ever understand a modern computer as well as I understood those computers of my childhood.

The spell checker was completely IO bound. The 1541 stored ~170KB of data. The words on disk were sorted and I generated a unique sorted list of words in memory. It was, at the time the fastest spell checker for the Commodore 64, it was packaged with a popular word processor of the time called WordPro 64. Most of the other spell checkers had much smaller dictionaries and took a LONG time to read the disk. I could read 170KB in something like 70s, which, using my data transfer improvements was 5x faster than any other spell checker at the time. I'm pretty sure I had more words in my dictionary too. But things are a bit fuzzy - that was nearly 30 years ago!


One of things that came to mind while reading the article is that spellcheck was a separate operation (F7 by default in Word Perfect), so the problem wasn't quite as difficult as autocorrect and auto-underlining have conditioned me to think about it.

Or, to put it another way, waiting a few seconds for access to a disk during spell check wasn't as big a deal as it would be today. Particularly since so much of what I type is now intended for instant publication. Twenty to thirty years ago, most of what got spellchecked were items that were to be printed...and mailed using snails.


The C=64 had 64KB of RAM. The 1541 disk could store 170KB on a side. Since the serial link between the two was the bottleneck, and both the computer and disk drive had the same 6502 CPU, it seems like the optimal performance would be achieved by having the computer send the document to the disk drive, and having the disk drive do all the actual work.

Implementing a spell-checker this way would probably take many more months of work. :-)


The problem was that the 6504 in the disk drive only had something like 2KB of RAM. It was better to transfer the data to the C-64.


A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl or Python ... That's progress.

That's progress in terms of cheap RAM and cheap CPU cycles. In terms of software architecture, that's not progress -- that's brute force.

Assuming that code for efficiently indexing and compressing this kind of dictionary was written 25 years ago, there doesn't seem to be any good reason not to reuse it. The fact that there's impedance to doing so means that our languages and our tools still need improving.

(25 years ago I'd have predicted that in 2012 some kind of AI-driven optimizer could have figured out the correct data structures for this problem and automatically converted the naive hash table lookup into a more efficient structure)

Also, jeesh, use mmap() and a binary search or something.


> Assuming that code for efficiently indexing and compressing this kind of dictionary was written 25 years ago, there doesn't seem to be any good reason not to reuse it.

What if the efficient code is extremely "clever" and hard to understand from reading the code (and thus hard to fix bugs, add new features, etc.)? Doesn't it make sense to revert to the simpler and technically less efficient version if you know that the hardware will be more than enough?

I suppose you could still argue that the more efficient version is better, at least once it is mature and packaged in some library or framework.


What makes sense is creating a spell checker library, and then having other programs used that instead of implementing spell check themselves.

Granted, if your program is such that a naive, 3-5 line python implementation is sufficient, then it is probably not worth the effort to use a library. If my web-browser, word processor, e-mail client, and every other large program that has text editing implements spell check themselves, I doubt I would get comparable performance to if they all use a library designed specifically for spell check, and optimized by many other projects that used it over history. On the subject, I would be interested to know if most large programs I mentioned actually impalement spell check themselves.


No


Well, nothing prevents this to be a 3-5 lines of Python with more suitable data structures:

    $ pip install dawg
and then

    import dawg
    words = open('/usr/share/dict/words', 'r').read().splitlines()
    d = dawg.DAWG(words)
(this example actually works)


Let's not store the whole file in memory at once!


I think the point of using data structures like DAWG is to reduce the memory consumption to the point it is feasable to store the whole dataset in memory.

Practical DAWG application would be the following anyway:

    import dawg
    d = dawg.DAWG().load('words.dawg')
because DAWG minimization may require a lot of memory.


Ok.

    import dawg
    with open('/usr/share/dict/words','r') as words:
        d = dawg.DAWG(w.strip() for w in words)


a minor remark: with the current `dawg` module implementation this should be

    import dawg
    with open('/usr/share/dict/words','r') as f:
        words = (w.strip() for w in f)
        d = dawg.DAWG(words, input_is_sorted=True)


I assume GP meant, don't store all the words in memory at once.


There is an AI driven optimizer in play[1], but what is optimised is what has changed. What is far more important now is optimising programmer time, time to market, testability etc. Using less CPU and RAM is largely irrelevant until the usage is actually noticeable by the user - taking 50 milliseconds instead of 70 milliseconds to do a spell check is not something any user can even detect. It would be a huge waste of programmer and purchasing of tools to focus on that instead of all the other things that need to be developed and made available to users.

[1] The optimizer is a human brain, or a collection of them if this isn't a lone wolf project


The advent of mobile computing has finally reverted the myth that processor speed solves all known problems. maximizing idle time still matters.


Time to market and developer productivity are still in the forefront. This is why you repeatedly see apps being released, user outcry and then later revisions trying to address those concerns (battery, cpu, memory, responsiveness etc). I doubt anyone actually maximizes idle time, but rather decreases user complaints and improve acceptance. There is no need to maximize it if users don't notice.


in line with what baddox says, if the tradeoff that were made 25 years ago don't match what you're willing to tradeoff today, it is normal to rewrite it. I18n support or better external library support, system integration could be any of the upside to rewrite it.


I wonder if someday in 30 years people will look back in the same way and say: wow, you know there was once a time when search engines were major feats of engineering - and Decentralized Networks were deemed impractical!

They just as us, were resource limited. The modern constraints have merely shifted. They came up with amazing tricks to get around physical bottlenecks - just as we do now. We tend to call our current foci engineering instead of 'bags of tricks/hacks'. But when technology progresses till we take things for granted that were once hard, we young ones look back with a mix of disbelief and awe that people used to go through so much trouble for something that is now so trivial.


I think search engines will always be major feats of engineering - the big difference between a spell-checker and a search engine is in the former the problem space rarely evolves.

An English dictionary, for example, only has a a small number of words added each year. A search engine is a different thing entirely: not only are new items added all the time, but the format of these items evolves. Today search engines need to deal with non-textual formats such as videos, images, and audio. This is a very difficult problem.

The fact that there are now APIs and libraries that let developers use things like latent semantic mapping without having to roll their own algorithms and solutions can certainly make writing an intelligent textual search engine a lot easier than it used to be.

But the key difference here is that what we demand from a 'search engine' (and by this I mean Google, or Bing) ten years will be much greater than what we demand now. We'll have new media to search for, new types of queries, new expectations. But a spell checker will always be a spell checker.

At least, that's my opinion. I'm sure someone will disagree!


Sure, I'll disagree :) When you have enough memory bandwidth to compute the relationships of all pages in just a few seconds, then you can re-index the entire corpus according to whatever criteria you want (PageRank or face detection or whatever) for each query. So you have some spiders that add content to the database, and re-index the database for every query.


Good point - I don't know if that solves the problem of getting different forms of media into the system in the first place though. I think as long as people create and new types of media there will be interesting problems to search for them. Searching video based off of text queries, for example, is a much harder problem than simple text search!


That assumes that memory bandwidth grows faster than sqrt(#pages)...


I would like to know some of the details of the algorithmic tricks that were employed to solve this problem. As I was reading, I was thinking of a tree structure encoding the dictionary with each letter (of a valid word) being a node. If the next letter isn't found, or if you reach a leaf with letters left in the candidate word, then you've misspelled something.

Does anyone have an idea of how much compression could be achieved through such a structure?


You're describing a trie, http://en.wikipedia.org/wiki/Trie


Multiple software engineers independently arriving at the same obvious solution? Sounds like something somebody would try to patent today...


Thankfully many of the novel solutions that were created in the early days weren't created by people that just wanted a buck.


Thanks for the link!

I guess he did say that in the article, but "trie" wasn't part of my algorithmic vocabulary...until now.


Taking that to the next level, you can improve that data structure to allow sharing between later similarities in words after they've diverged from their common prefixes. That's particularly helpful to cut down on redundant storage of suffixes like -ing, -ed, -ly, etc.

The directed acyclic word graph (DAWG) is a popular approach to that: http://en.wikipedia.org/wiki/Directed_acyclic_word_graph


It's not quite the same thing but this problem strongly reminded me of Chapter 1 of Programming Pearls: http://www.cs.bell-labs.com/cm/cs/pearls/cto.html


Tries for dictionaries are the simplest method. However, it limits the search method to forward-only. For more sophisticated dictionary searches there is DAWG:

http://en.wikipedia.org/wiki/Directed_acyclic_word_graph

http://www.pathcom.com/~vadco/cwg.html (a highly compressed DAWG derivative)


Does DAWG really provide more sophisticated searches than a trie being basically a minimized automata for the trie?


The compression of a trie may not be the best because of the node storage requirement. I've benchmarked the memory usage of some Python trie libraries here: http://kmike.ru/python-data-structures/

I also tried to compress "/usr/share/dict/words" with https://github.com/kmike/DAWG and https://github.com/kmike/marisa-trie Python libraries:

    import dawg
    import marisa_trie

    words = open('/usr/share/dict/words', 'r').read().splitlines()

    dawg.DAWG(words).save('words.dawg')
    marisa_trie.Trie(words).save('words.trie')
The result is a bit surprising:

    1220612 words.dawg
    743128  words.trie
Please note that MARISA-trie is not a classic trie, it is a smart & crazy recursive trie (something like DAWG-Trie hybrid). By the way, I was expecting DAWG to perform much better; for my data (5mln Russian words) the DAWG compression was much more impressive.


Good guess! That's the trie that he mentions. I would have also approached it that way.


Given the set of valid words in a document is (by definition) a subset of the words in the dictionary, I'd imagine you're better off holding the document in memory, building a datastructure from that and then looking through the disk for the set of words in the document that weren't found in the dictionary.


It is called a Trie or a Prefix Tree, and has been (re)invented many times.


That's optimal for 100% accuracy. You can get much lower memory usage using Bloom filters (basically, hashing words into a small range of hashcodes, expecting collisions, and storing a mask of all the hashcodes that correspond to words), at the expense of false positives. (thinking a mispelled word is correct.) This was a common solution in the 80s.


Actually I tend to disagree. Yeah, you can do that for a hobby project.

I'm working in OCR / ICR here. We're often depending on dictionaries to fix recognition errors. So you want to correct names for example, by creating a more or less complete list of all names (we're assuming you know all the names in advance, for example for customers). The dictionary is huge.

"Looking up a word in this hash table dictionary is a trivial expression, one built into the language. And that's it."

Well, no. A spellchecker that just answers "I don't know that word" would be crap. You need to find similar words. Which isn't part of any standard library I am familiar with. In addition it's still a challenge to find all candidates that have a levenshtein distance (or LD or .. whatever metric you like) of N from the input, for N > 2 or 3.

Think address. You have a list of all streets, which tend to be rather long (Germany: "Strasse" = street, the pattern "<name> Strasse" is common, let's assume you expect 12-14+ chars). The longer the expected input, the higher I'd set the tolerance for errors. Finding all matches with 3+ substitutions/deletions/additions on a large dataset, in a small amount of time is still interesting and a challenge.

In fact, we partner with one company that provides nothing but exactly this, as their single (well-known, respected) product.


You should probably also go Bayesian on word frequencies for the suggestion ordering. Maybe also on error types (missing letter, repeated letter, nearby (on standard keyboard) letter.


Your comment and the reply to it are good if we're talking spell checker software for stuff people type.

What I have to deal with as well are recognition errors. From trivial replacements (I/l/1,C/G,i/j,P/R etc.) to 'engine said that there's some character here according to the character segmentation results, but it has no damn clue what character it could be' that's a different look at the same problem.

Also, for typed content I see quite a lot of shortcuts. SwiftKey for Android, nice as it is, seems to be too braindead to correct words that _start_ with the wrong character (if I type vharacter, with just one single slip at the start, it fails miserably and looks baaad imho). Assumptions like 'just check adjacent characters on the keyboard (special points for assuming a keyboard layout)' or 'ah, the user is certainly starting the jnput correctly and just might misspress a letter in the middle of a word' are nice models to make the developer's life easier, but they don't work for a good number of cases as well..


I used soundex and metaphone to correct spelling mistakes for a project. My assumption: that most spelling mistakes are people trying to sound out the spelling and not simple typos. For the limited case I was solving (mostly Biblical names) it worked quite well.


"That's progress."

Yes, but not in software. It's the hardware that's improved. So the sw can be sloppy and inefficient and most people won't notice.

Hardware has improved. Software has not. This is hard for some folks to accept. But it is a plain truth. Take away the added power of new hardware and what's left? You wouldn't want to know.

It is easier to do the same basic computing tasks as in 1984, with less effort, but it's not easier to exceed what you could do in 1984 because you have better software. It's because you have better hardware and can afford to be sloppy. 2-3 lines of perl/Python but how many lines are in the binary or the libraries? (e.g. what if you discovered there was gratuitous use of backtracking in _all_ perl's regex, even for the simplest matching; who would care?) There is no Moore's Law for software. To put it another way, most of the computing tasks being done today are the same ones (albeit on a different scale) and use the same methods (or even lazier ones) as in 1984. Hence the example of spellcheck. It's one of many tasks performed in 1984 and today and it will be performed in the future.

It's easy to mistake advances in hardware (more power) for advances in software (novel and more efficient approaches to problems).


Trying to look at how software performance has improved in the way you are looking at it doesn't really work. There is no point putting time into optimising things that run fast enough. There are many algorithms that have been improved dramatically. People use this kind of matching for DNA, nobody is busy creating slower algorithms, they are only getting faster (e.g. www-igm.univ-mlv.fr/~lecroq/articles/cpm2009fl.pdf). Prime factorisation, optimisation, etc. are all being actively researched.

In other areas people are doing things which weren't even possible on old hardware, think about 3D graphics. Real time lighting approximations are doing really clever stuff, just look at the latest unreal engine demo. If you put these into old hardware you would get the same quality render in less time. Modern ray tracing engines are using more sophisticated algorithms, giving better results in less time.

Beyond the algorithmic work optimising for modern hardware is different to old hardware. An optimal program for old hardware isn't optimal for modern hardware and vice versa. Now if a program needs data from RAM you have to wait hundreds to cpu cycles. Even accessing the L3 cache, which is in a modern processor chip, takes about 75 cycles on modern chips. So you can recalculate things and find your program runs faster.

Similar things happen with branches, I was looking at some GPU shaders recently and there was a simple raytracing loop (parallax occlusion mapping), the obvious way to do it is to stop when you intersect the surface. Instead it is actually faster to always run the loop for 10 steps because this removes the branch. So you do twice as many loop iterations on average but your code runs faster.

Of course you can afford to be sloppy as well. People are 'sloppy' so that their code is faster to write, easier to read and less likely to have bugs (3 lines of python is pretty likely to be easy to read and bug free). I would say software is advancing in many different ways, which hardware has enabled to happen.


The numerous references to hardware in your comment only serve to illustrate my point.

Of course there have been some new algos and improvements to existing ones since 1984. But nothing approaching a Moore's Law. The dramatic changes the blog post highlighted are due to hardware, not software.

Take away the hardware advances (hold the hardware as a constant), then look at software as the variable. Then we can have a meaningful discussion of software improvements over time. IOW, run new software on old hardware.

To measure advances in hardware, hold software as the constant. IOW, run old software on new hardware.

If both are variables, it's difficult to assess how much software has improved on it own, without the benefit of new hardware.

Linus Torvalds' comments on the future of RISC, CPU instructions, and compatibility versus "cool new features" in the /. interview were interesting.


Just because a problem has been "solved" for one platform (Fast PC's w/ Gigabytes of ram/disk) doesn't mean it is solved for all platforms (Web apps for phones/tablets, etc).

John Resig (of jQuery and Khan Academy fame) had a really interesting thread on creating a fast-loading (i.e. small, efficient) dictionary for a JavaScript app. http://ejohn.org/blog/revised-javascript-dictionary-search/

The comments have some really clever ideas for encoding the dictionary and Resig weighs in on his favorite solution with informative data. There's an interesting "Succinct Trie" data structure that turns a 620K dictionary into a 220K string that doesn't need to be decoded to be searched. http://stevehanov.ca/blog/index.php?id=120


LOL, the URL is: http://prog21.dadgum.com/29.html?repost

That might explain why it's so familiar.



> For reference, on my MacBook, the standard dictionary in /usr/share/dict/words is 2,486,813 bytes

Polish dictionary is about 3,500,000 /words/, many more are generated by the affix file. From archeology I've done some time ago, about year 1997 there were custom scripts written by creators/maintainers of Polish ispell dictionary, because as the files note using the standard toolset would require over 1 GB of RAM (this was in year 1997, mind you).


"Writing a spellchecker in the mid-1980s was a hard problem. Programmers came up with some impressive data compression methods in response to the spellchecker challenge."

It's still a hard problem. Most working programmers today couldn't write a good spellchecker if their life depended on it.

Sure, your average coder can load the dictionary file into RAM in a hash table (built by someone else), and a few of those could maybe use a library implementing a Trie or a DAWG (built by someone else)...and a percentage of those might even understand what they're doing well enough to know how to use those tools to make things better. But once you've got the data into memory, what do you do with it? About the most anyone can do is point to Peter Norvig's blog post.

Sadly, very few people understand the theory behind it well enough to make improvements. So for everyone else, it's a major feat of software engineering -- on a bigger computer.


Sure, it's still a hard problem, like carpentry is a hard problem. I've never had to do it because it's been solved for 20 years, but I'm fairly sure I'd get great at it if it were my day job.


loading data into a hash table is a solved problem. coming up with clever ways of solving problems that are beyond the scale of today's computers is the challenging, interesting part.


Writing a spellchecker is still a major feat of software engineering. Loading up some words into a hash table does not a spellchecker make.


I wouldn't call it /major/, but a good one is still more than an afternoon's worth of mucking around with an in-memory hash table.


It's pretty serious, to hear Peter Norvig tell it: http://norvig.com/spell-correct.html. Also, it's one thing to do it, and another thing to make it fast.

Seems like it's one of those 80%/20%, to butcher that expression, where 80% of the effort goes into the last 20% of speed/quality. :)


What's major?

I'm pretty sure there's several (if not decades) of man years of work that has gone into the MS Word spell checker.


Wow 256K, the luxury!

Apple //e, 64K memory (possibly 128K if you had the memory expansion card) (The Apple ][+ only had 48K)

You start with the 143K floppy containing the word processor in Drive 1 (S6,D1) and the floppy with your document in Drive 2 (S6,D2). Edit away, remember to save occasionally.

When you were done with your rough draft you save your file and exit the word processor and swap in the floppy containing the spell check program. Point it at your file and swap in the floppy with the dictionary. Work through the file and approve/reject it's suggestions.

Exit the spell check and swap in the word processor floppy...

Repeat as often as necessary.


We have all this power and we use it in quite dull ways.

Not using a dictionary seems like more fun:

(http://www.spellingsociety.org/journals/j20/spellchecking.ph...)

> Morris, Robert & Cherry, Lorinda L, 'Computer detection of typographical errors', IEEE Trans Professional Communication, vol. PC-18, no.1, pp54-64, March 1975.


Wow, this article has no research and is just half-baked speculation. "there were some very clever data structures" yeah? Like what? And the proposed solution? Load the whole thing into a hash? Did you forget that spellchecker offer suggested corrections? How do you expect to implement that?


I'm no expert on data structures or algorithms but couldn't you just simply use bloom filters?


I don't think a bloom filter would be very useful for a spell checker. Bloom filters reduce disk reads when the majority of keys searched for are not in the dictionary. But in the context of a word processor, the majority of words tend to be spelled correctly, so very few would be rejected by the filter.


Bloom filters were in fact used.


If anyone is looking to refresh their programming chops for a problem like this, here [1] is a fairly straightforward, simple implementation of a spell-checker. Provided for a Python Course at Harvey Mudd (the class I TA uses this new curriculum), the task is to memoize a recursive string-distance algorithm. The results are actually quite lovely.

[1]: https://www.cs.hmc.edu/twiki/bin/view/ModularCS1/SPAM!

Note: copy the entire URL, the `!` breaks the hyperlinking.


Interesting, the README file in /usr/share/dict contains this:

--------------------------------------------------------

# $NetBSD: README,v 1.2 1997/03/26 07:14:32 mikel Exp $

# @(#)README 8.1 (Berkeley) 6/5/93

WEB ---- (introduction provided by jaw@riacs)

-------------------------

Welcome to web2 (Webster's Second International) all 234,936 words worth.The 1934 copyright has elapsed, according to the supplier. The supplemental 'web2a' list contains hyphenated terms as well as assorted noun and adverbial phrases. The wordlist makes a dandy 'grep' victim.

--------------------------------------------------------


A good spell checker is still a major feat of software engineering.


Large memory sizes did a similar thing to visibility determination in computer graphics. Determining which parts of multiple 3D intersecting shapes are visible used to be tricky and dependent on the shapes - triangle vs triangle was different than triangle vs sphere, tube vs sphere and so on. Now you can just use z-buffering which determines the visibility on a per-pixel basis, not to mention that all of this is implemented for you in hardware.


A better title: "An English-only spellchecker with a fixed, static vocabulary used to be a major feat of software engineering".

(because other cases still pretty much are)


Fascinating! I look forward to the day when data collection methods are similarly looked upon as trivial by future standards. Dealing with PDF documents one cannot scrape, yet needs the data, gets to be ridiculous!


Grate spellcheckers continue two bee a major feet of software engineering.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: