If I understand the task correctly, it is to map a word into another word that belongs to the set of words that have the same letter frequency distribution.
So a possible solution is to define a representation for a letter frequency distribution, and use an associative data type for the mapping.
Since the problem does not concern itself with prefixes, there is no need for a prefix tree - in Python, a simple dict would do.
A few words about style: when you write code, you have to come up with names for variables, functions, and other things. In this case, it seems you just went ahead and wrote down the first thing that came to mind, which is sometimes verbose because you're still in problem exploration mode and things are not clear. It's OK to do that, but please consider re-naming later on, so that instead of a first-thing-I-thought-of name, you'll have a name that gives the reader a sense of clarity. The process of doing that usually leads to better understanding of the problem and perhaps better code. With experience, you'll find that the need to re-name is reduced.
As a rule of thumb, the length of the name should correlate with the its scope. Names with indefinite scope tend to be long, while names with limited scopes tend to be shorter.
Finally, I encourage you to write more posts, and hope my critique is helpful.
> Since the problem does not concern itself with prefixes, there is no need for a prefix tree
This is a very important point, and many others in this thread have missed the fundamental point with using a trie. A real world example would immediately showcase its benefit.
In the old feature phones, you needed to punch the number keys many times to get a letter. Some people improved this using something called predictive typing - which basically goes like this:
1. Each number can be mapped to one of 3-4 possible letters
2. Based on the current sequence of numbers typed in (i.e. the prefix up to this point), this can mean 0 or more words which are in the dictionary.
3. Using a trie data structure, you basically have a few possible leaf nodes (only dictionary words can be leaf nodes in the predictive typing scenario) which all map to the current sequence of numbers.
4. Hence you will notice one or more options from which you can select based on the number sequence you typed, thus saving you many keystrokes
You can present the different options sorted by word frequency etc. if there are multiple matching leaf nodes for the given prefix.
When the prefix scenario does not matter - as in OP's article, from what I remember the trie gives no benefit over a simple sorted dictionary.
>So a possible solution is to define a representation for a letter frequency distribution, and use an associative data type for the mapping.
out of curiosity, could one such representation just be all of the letters in the word, sorted? I wonder how the time-complexities would compare. I assume a trie would still be faster.
Yes. Another simple representation if you assume a maximum number of occurrences per letter:
(defvar *alphabet* "abcdefghijklmnopqrstuvwxyz")
(defconstant capacity-in-bits 3)
(defun key (word)
(loop with key = 0
with max-occurrences = (1- (ash 1 capacity-in-bits))
for letter across *alphabet*
for i upfrom 0
for occurrences = (count letter word)
do (if (> occurrences max-occurrences)
(error "too many occurrences")
(setf (ldb (byte capacity-in-bits (* i capacity-in-bits)) key)
occurrences))
finally (return key)))
from collections import Counter
freq_to_words = {}
make_freq = lambda word: tuple(sorted(dict(Counter(word)).items()))
with open('/usr/share/dict/words', 'r') as f:
for line in f.read().split('\n'):
freq = make_freq(line)
if freq not in freq_to_words:
freq_to_words[freq] = []
freq_to_words[freq].append(line)
print freq_to_words[make_freq('gas')] # ['gas', 'sag']
For time complexities, this requires O(n) preprocessing but can do O(1) lookups, since it's a hash structure.
Ok, so I did my bachelor thesis on Tries (suffix tries to be exact). So imma run wild
If you get to preproccess, the trie is best. Building the trie takes time and memory that is O(size of dict = k). After that, lookups are O(length of words = n). Meanwhile, binary search is O(n log k) (that factor n their is kinda contentious as generally you will find a mismatch early in the word).
So if the dictonary is really large, or you're doing a shitton of lookups, the trie is better than binary search. Sadly, in practice tries don't work very well with caches. However, there is a very interesting intermediary form based on something called LCP.
LCP stands for longest common prefix. The idea is to prevent rechecking the first part of the word by keeping track how well we matched at the left and right boundaries of our search interval.
There is another magic trick we can use to get this technique up to the level of a trie. This depends on the Range Minimal Querry. A range minimal querry asks, given a range (i, j) and an array A, what is the index of the minimum of A between A[i] and A[j]. It turns out that, with linear preprocessing, we can answer this in constant time. Now, we build an array that, for any 2 adjacent words stores their LCP. Any time we have an interval in our search, we don't go to the middle, but to the RMQ-index of that interval. This way we split our search space more efficiently. In fact each interval in this search method corresponds to a branching node in the trie.
I agree that binary search might not be the best (depending on constraints), my initial complaint was there were faster ways to solve problem stated than searching every word every time.
I personally like the binary tree method for systems that don't or can't load a dictionary in memory. I particularly like it because you can implement a search that requires little memory and does not require you to read the entire file. But I imagine that there are better ways to lay out the data that potentially could get you the best of both worlds.
I can see how a trie would be useful for doing things more than just checking if the word is spelled correctly viz finding list of words that start with a prefix. The question asked was mostly about looking up a word. I posted something on this below before I found your reply. In my post below I am concerned with only checking if the word exist. Are my concerns with regards to the trie valid? Would you agree that hash table would faster than a trie (any kind) for checking if the word exist? Or have I overlooked something?
Comparing a hash-table to a trie is difficult in practice.
Theoretically, they are equally fast. Both take O(length of word) time. You could give an advantage to the trie because it allows for early failure detection, whilst the hash table requires processing the full word.
In practice though, for short words (e.g. < block size of the hash) the hash might effectively be constant time. Finally, cache behavior is going to really matter tries require a pointer dereference for each character (you can mitigate this someone by storing non-branching paths along the trie better). Such dereferences tend to really wreck a cache.
So in the end, it's better cache behaviour for the hash table vs early stopping for the trie. Cache misses are so expensive, I'd guess the hash table wins in most situations, but that is really a guess.
I see your point about early detection. While a trie will have precise early failure detection a hash will have 2 ways to fail early.
One would be the hash is simply not found. But at that point you had to at least look at the entire input word, compute the hash and check for the hash. I have not done any actual analysis on this but just for thought let us look at the hypothetical case that words would fail near halfway through the testing process. In a hash we would have needed to at least read the full word while in a trie comparison of at least half the word with half the letters in the trie path would have needed to be processed. So conceptually (at least in my head) both the hash and trie would need to have done something with about the same number of letters. The question there is is computing and checking the hash more efficient than traversing the trie.
The second way the hash would fail early is when we have a hash match and comparing the resulting words the hash entry held. Now if the collision rate was low then we might only have to check one word, if it is high then many words. So I would hope the hash was tuned for a low collision rate. But when comparing two words there is also early detection (this is the point I was getting at). The entire word does not need to be compared, it would likely be compared word at a time (word as in 4 or 8 bytes at a time).
If I have time this weekend I think I want to implement both of these to benchmark.
All that being said, the more I think about it I think a trie would be the best for a spell checker in general. Simply checking if this is a word or not is nice, but what is nicer is the ability to suggest alternatives to the misspelled word and a trie would be mucher nicer to work with with building that list.
I missed the point (as others have said) that a Trie will take less memory as it stores common prefixes only once.
As for early mismatch detection, the statistics of that elude me at the moment. Certainly, early detection decreases as you get more words. However, as you get more words, you probably get longer words (lest you run out of space) so that makes early detection more valuable.
If I recall correctly, that is kind of like a compressed trie, only instead of a tree one builds a directed acyclic graph. Basically, taking common suffixes into account aswell.
Space usage is indeed the biggest advantage, as you are reducing redundant branches from the tree for of a trie. Construction of these seems difficult (from the article, it seems to involve building the tree, and using hashes to find redundant branches.
I think I never gave them much thought because it doesn't work for suffix tries, which can be used not to search for prefixes (like a normal trie) but for substrings (considering them prefixes of suffixes). The optimalization to get the O(n^2) total length of all suffixes seem to make it hard to turn this into a DAG.
I would argue a trie is not that good compared to a hash. Each letter is going to be a list (which you have to search), or have a fixed sized array pointing to the next letter (could be problematic with languages that use more than a-zA-z).
I just don't want to pretend that its trie or check all 235,887 words every time.
This blog post could have been something much more, had it not stuck with the naive search every word method as the only alternative to a trie.
> I would argue a trie is not that good compared to a hash.
It can be more memory efficient than a hash table if there are enough shared prefixes, and it supports prefix and range queries. As always, different data structures have different tradeoffs.
> Each letter is going to be a list (which you have to search), or have a fixed sized array pointing to the next letter (could be problematic with languages that use more than a-zA-z).
A common maneuver here is to encode strings as byte sequences in UTF-8 (or some such encoding), use 256-element arrays for nodes with enough children that the linear search would be inefficient, and possibly to use hash tables for nodes that are somewhere in the middle. See here for more discussion:
Not everyday I get to share this! I wrote some code in Go that uses a trie to quickly find anagrams. My trick is that I load the trie with a dictionary, but before I insert each word, I sort it alphabetically (ex: sort -> orst). Before I lookup a word I sort it alphabetically, so when I locate it in the trie, I also get all the other words that share the same characters. Because of the nature of the data structure, I can also quickly find all the smaller anagrams which will be handy when I get around to turning this into a full fledged scrabble playing AI: https://github.com/RyanEdwardHall/anagrambler
It is refreshing to see a data structure "explained" without a drawing of some boxes and arrows. +1 there for the creativity -- I'd be even more pleased to see a data structure explained with words without relying on images at all.
Also:
...to check if a word exists in the text file, it takes
at most, as many operations as the length of the word
itself. Much better than the 235,887 operations it
was going to take before.
But your source for the words (/usr/share/dict/words) is probably already sorted! I don't think that's a fair upper bound to quote.
Still, trie lookup time is independent from your dictionary size. Sadly, in practice they are hurt by the many pointer dereferences of traversing the tree.
If you have a sorted list of strings, you can speed up the search by tracking how many starting characters on the left and right boundary of your interval already match the given word. In practice, this turns out to be almost as good as tries as most steps eliminate quite a few characters to check.
If you like theory, you can add on Range Minimal Query to get the same theoretical performance as tries. In practice though, this isn't necessary.
The coolest use of Tries I've ever seen is the Apriori algorithm. It exploits the trie properties and the structure of the query sequence in a really clever way. It's beautiful.
One might note that properly implemented hash tables also have O(k) lookup times. And in fact, much of the time well-tuned hash tables outperform well-tuned tries.
Tries still have some advantages though:
1) The naive implementation of a trie of using an alphabet-sized vector for the child nodes is much closer to the ideal performance than the naive implementation of a hash table. If I had to implement a dictionary on short notice, I would use a trie.
2) Tries preserve ordering, while still allowing O(1) [in the number of items in the dictionary] insert, removal and lookup. If you need to get the key that is lexicographically next or previous, to another key, or you need to find the two keys that bound a particular value, it's a great structure to use. Comparison based trees can also do this but not with constant time insert and remove.
For a really space-efficient trie-like structure, see crit-bit trees[1] which are basically a space-efficient radix tree with on a binary alphabet (so each internal node has a fanout of exactly 2).
I recently wrote a basic trie in Nim for fun, I started with a preallocated array and then rewrote parts to hashtables, it's simpler since you don't need to care about the dictionary size and it can have space savings depending on how the table grows. It wouldn't surprise me if there was a fancy bitmapping approach though, or if you could be more efficient if you had n-gram statistics. (And as mentioned elsewhere for the purposes of storing/searching a dictionary you could just have a flat hashtable indexed by the words, or a bloom filter if you only care about certainly not existing...)
I'm not familiar with Python, but would a C extension be the best way to implement a data structure like this, that requires high performance? Or is it possible to write efficient enough code in Python?
When algorithms matter, you do get a lot of performance out of a better algorithm. But I actually recently wrote an admittedly not production ready trie extension in C and it was about five lines of SWIG to use it from python.
The problem with anagrams is you'd have to test all permutations of a word, which is n! hash lookups. Alternately you could store all prefixes in the set and prune the permutations that way. Slightly more interesting than word in set(words).
So a possible solution is to define a representation for a letter frequency distribution, and use an associative data type for the mapping.
Since the problem does not concern itself with prefixes, there is no need for a prefix tree - in Python, a simple dict would do.
A few words about style: when you write code, you have to come up with names for variables, functions, and other things. In this case, it seems you just went ahead and wrote down the first thing that came to mind, which is sometimes verbose because you're still in problem exploration mode and things are not clear. It's OK to do that, but please consider re-naming later on, so that instead of a first-thing-I-thought-of name, you'll have a name that gives the reader a sense of clarity. The process of doing that usually leads to better understanding of the problem and perhaps better code. With experience, you'll find that the need to re-name is reduced.
As a rule of thumb, the length of the name should correlate with the its scope. Names with indefinite scope tend to be long, while names with limited scopes tend to be shorter.
Finally, I encourage you to write more posts, and hope my critique is helpful.