*A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl ...

tshaddox · on Oct 11, 2012

> Assuming that code for efficiently indexing and compressing this kind of dictionary was written 25 years ago, there doesn't seem to be any good reason not to reuse it.

What if the efficient code is extremely "clever" and hard to understand from reading the code (and thus hard to fix bugs, add new features, etc.)? Doesn't it make sense to revert to the simpler and technically less efficient version if you know that the hardware will be more than enough?

I suppose you could still argue that the more efficient version is better, at least once it is mature and packaged in some library or framework.

gizmo686 · on Oct 11, 2012

What makes sense is creating a spell checker library, and then having other programs used that instead of implementing spell check themselves.

Granted, if your program is such that a naive, 3-5 line python implementation is sufficient, then it is probably not worth the effort to use a library. If my web-browser, word processor, e-mail client, and every other large program that has text editing implements spell check themselves, I doubt I would get comparable performance to if they all use a library designed specifically for spell check, and optimized by many other projects that used it over history. On the subject, I would be interested to know if most large programs I mentioned actually impalement spell check themselves.

alexchamberlain · on Oct 11, 2012

kmike84 · on Oct 11, 2012

Well, nothing prevents this to be a 3-5 lines of Python with more suitable data structures:

    $ pip install dawg

and then

    import dawg
    words = open('/usr/share/dict/words', 'r').read().splitlines()
    d = dawg.DAWG(words)

(this example actually works)

lurker14 · on Oct 11, 2012

Let's not store the whole file in memory at once!

kmike84 · on Oct 11, 2012

I think the point of using data structures like DAWG is to reduce the memory consumption to the point it is feasable to store the whole dataset in memory.

Practical DAWG application would be the following anyway:

    import dawg
    d = dawg.DAWG().load('words.dawg')

because DAWG minimization may require a lot of memory.

Lexarius · on Oct 11, 2012

Ok.

    import dawg
    with open('/usr/share/dict/words','r') as words:
        d = dawg.DAWG(w.strip() for w in words)

kmike84 · on Oct 11, 2012

a minor remark: with the current `dawg` module implementation this should be

    import dawg
    with open('/usr/share/dict/words','r') as f:
        words = (w.strip() for w in f)
        d = dawg.DAWG(words, input_is_sorted=True)

sophiebits · on Oct 11, 2012

I assume GP meant, don't store all the words in memory at once.

rogerbinns · on Oct 11, 2012

There is an AI driven optimizer in play[1], but what is optimised is what has changed. What is far more important now is optimising programmer time, time to market, testability etc. Using less CPU and RAM is largely irrelevant until the usage is actually noticeable by the user - taking 50 milliseconds instead of 70 milliseconds to do a spell check is not something any user can even detect. It would be a huge waste of programmer and purchasing of tools to focus on that instead of all the other things that need to be developed and made available to users.

[1] The optimizer is a human brain, or a collection of them if this isn't a lone wolf project

lurker14 · on Oct 11, 2012

The advent of mobile computing has finally reverted the myth that processor speed solves all known problems. maximizing idle time still matters.

rogerbinns · on Oct 11, 2012

Time to market and developer productivity are still in the forefront. This is why you repeatedly see apps being released, user outcry and then later revisions trying to address those concerns (battery, cpu, memory, responsiveness etc). I doubt anyone actually maximizes idle time, but rather decreases user complaints and improve acceptance. There is no need to maximize it if users don't notice.

hrktb · on Oct 12, 2012

in line with what baddox says, if the tradeoff that were made 25 years ago don't match what you're willing to tradeoff today, it is normal to rewrite it. I18n support or better external library support, system integration could be any of the upside to rewrite it.