Hacker News new | past | comments | ask | show | jobs | submit login

A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl or Python ... That's progress.

That's progress in terms of cheap RAM and cheap CPU cycles. In terms of software architecture, that's not progress -- that's brute force.

Assuming that code for efficiently indexing and compressing this kind of dictionary was written 25 years ago, there doesn't seem to be any good reason not to reuse it. The fact that there's impedance to doing so means that our languages and our tools still need improving.

(25 years ago I'd have predicted that in 2012 some kind of AI-driven optimizer could have figured out the correct data structures for this problem and automatically converted the naive hash table lookup into a more efficient structure)

Also, jeesh, use mmap() and a binary search or something.




> Assuming that code for efficiently indexing and compressing this kind of dictionary was written 25 years ago, there doesn't seem to be any good reason not to reuse it.

What if the efficient code is extremely "clever" and hard to understand from reading the code (and thus hard to fix bugs, add new features, etc.)? Doesn't it make sense to revert to the simpler and technically less efficient version if you know that the hardware will be more than enough?

I suppose you could still argue that the more efficient version is better, at least once it is mature and packaged in some library or framework.


What makes sense is creating a spell checker library, and then having other programs used that instead of implementing spell check themselves.

Granted, if your program is such that a naive, 3-5 line python implementation is sufficient, then it is probably not worth the effort to use a library. If my web-browser, word processor, e-mail client, and every other large program that has text editing implements spell check themselves, I doubt I would get comparable performance to if they all use a library designed specifically for spell check, and optimized by many other projects that used it over history. On the subject, I would be interested to know if most large programs I mentioned actually impalement spell check themselves.


No


Well, nothing prevents this to be a 3-5 lines of Python with more suitable data structures:

    $ pip install dawg
and then

    import dawg
    words = open('/usr/share/dict/words', 'r').read().splitlines()
    d = dawg.DAWG(words)
(this example actually works)


Let's not store the whole file in memory at once!


I think the point of using data structures like DAWG is to reduce the memory consumption to the point it is feasable to store the whole dataset in memory.

Practical DAWG application would be the following anyway:

    import dawg
    d = dawg.DAWG().load('words.dawg')
because DAWG minimization may require a lot of memory.


Ok.

    import dawg
    with open('/usr/share/dict/words','r') as words:
        d = dawg.DAWG(w.strip() for w in words)


a minor remark: with the current `dawg` module implementation this should be

    import dawg
    with open('/usr/share/dict/words','r') as f:
        words = (w.strip() for w in f)
        d = dawg.DAWG(words, input_is_sorted=True)


I assume GP meant, don't store all the words in memory at once.


There is an AI driven optimizer in play[1], but what is optimised is what has changed. What is far more important now is optimising programmer time, time to market, testability etc. Using less CPU and RAM is largely irrelevant until the usage is actually noticeable by the user - taking 50 milliseconds instead of 70 milliseconds to do a spell check is not something any user can even detect. It would be a huge waste of programmer and purchasing of tools to focus on that instead of all the other things that need to be developed and made available to users.

[1] The optimizer is a human brain, or a collection of them if this isn't a lone wolf project


The advent of mobile computing has finally reverted the myth that processor speed solves all known problems. maximizing idle time still matters.


Time to market and developer productivity are still in the forefront. This is why you repeatedly see apps being released, user outcry and then later revisions trying to address those concerns (battery, cpu, memory, responsiveness etc). I doubt anyone actually maximizes idle time, but rather decreases user complaints and improve acceptance. There is no need to maximize it if users don't notice.


in line with what baddox says, if the tradeoff that were made 25 years ago don't match what you're willing to tradeoff today, it is normal to rewrite it. I18n support or better external library support, system integration could be any of the upside to rewrite it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: