Announcing Soulmate, a Redis-backed service for fast autocompleting

timr · on Feb 15, 2011

Are you really using the technique described in the Redis auto-complete page? Doesn't that method take a lot more RAM than is necessary using a more specialized approach (i.e. a trie)?

Also, from what I can tell, every query is log(N) in the size of the completion set, instead of linear in the length of the query/suggestion (again, like a trie). Seems like this might have trouble scaling to large suggestion sets.

ericwaller · on Feb 15, 2011

We're actually using a slight variation of the second technique described, which involves maintaining a sorted set of the most relevant results for every possible prefix.

A query for a single-prefix term (i.e. "yank") is technically O(log(N)+M), where N is the # of items in the sorted set, and M is the # of items returned. But since in practice we often want a small # of items, we can keep both N and M small.

One tradeoff we made is that this approach doesn't support incremental updates easily, you can always add new items, but it would take a good bit of work to remove items that have expired, or update the scores of items that change.

As for memory usage, we only store each item once (in a redis hash table), and use the unique ids as entries in the sorted sets. Which sort of approaches a trie, since instead of:

ya -> yan -> yank -> yanke -> yankee -> yankees -> [data about yankees]

we have:

ya -> 101

yan -> 101

...

101 -> [data about yankees]

timr · on Feb 15, 2011

Yeah, that's what it sounded like.

Aside from not supporting incremental updates, that approach has a high memory overhead. For every word of length N, you've got to store 1 + 2 + ... + N-1 + N characters, which is O(N^2). The constant is 1/2, and you do get some savings in shared prefixes, so that's definitely a worst-case upper bound -- but a trie grows with O(N).

The O(log(N)) lookup time can also be a limiting factor with large sets. A trie is going to give you worst-case linear lookup time in the length of the longest string in your set. This can be a pretty dramatic difference as string sets get large.

Robin_Message · on Feb 15, 2011

Isn't that O(Nk), where k is the average word length and N is the number of words. Given an upper bound on k, this is no worse than a trie. Of course, if I've misunderstood that part, then ignore what follows.

In a real world scenario (English language), since words share so much structure, this only grows as O(N). [1] Although the top 1000 recently searched items might not share as much structure, that is a trivial amount of data, even for O(N^2). The storage overhead on a trie is high, especially as many bigrams don't occur in English, so array-based tries necessarily waste space.

You're right that the lookup time is killer though. A ternary search tree gives you a good trade-off – really it's a just a good way to store a trie with less memory overhead.

1. Running the following commands

    # wc /usr/share/dict/words
    234936
    # cat /usr/share/dict/words | sed '{: foo
        p
        s/^\(.*\).$/\1/
        t foo
        p
    }' | sort | uniq | wc
    791089

shows only a factor of 3.3 increase in all the prefixes on a large dictionary of English. For propernames, 1323 words turns into 3612 prefixes, a factor of 2.7.

timr · on Feb 15, 2011

"Isn't that O(Nk), where k is the average word length and N is the number of words. Given an upper bound on k, this is no worse than a trie."

Let's make an even stronger assumption that every word in your set has length k. You then need to store (k^2 + k)/2 characters per distinct word with the sorted set implementation (plus any additional data-structure overhead). In a worst-case scenario, you have N of these k-length words, distinct at the first character (this might seem paranoid for something like ASCII, but think about Unicode here...or just change the problem to N distinct strings at character >=2), resulting in a worst-case memory bound of O(N * ((k^2 + k) / 2)).

That said, the possible range of N will be much larger than k for any non-trivial k, so I suppose it's fair to say that real-world memory use is something more like O(N) with a very high constant that's quadratically related to your average word length. But that's still substantially worse than a well-implemented trie (where your constant is small and unrelated to k), and with large data sets every byte counts.

"Although the top 1000 recently searched items might not share as much structure, that is a trivial amount of data, even for O(N^2)....the storage overhead on a trie is high, especially as many bigrams don't occur in English, so array-based tries necessarily waste space."

If your set size is only 1,000, then yeah, this conversation is academic. I'm talking about big suggestion sets -- for a website, it's not too difficult to get into a situation where your suggestion set is in the hundreds of thousands or millions of strings or more. Remember that web text is nothing like /usr/share/dict. Proper nouns, acronyms, non-words, new words, foreign words, etc. all conspire to make real-world language much larger and more diverse than the system dictionary. The size of real-world vocabularies is one of those counter-intuitive things that I wouldn't have guessed before working on websites.

Finally, while you definitely have to be careful when implementing a trie, there are improved trie algorithms (like burst tries), that greatly improve upon average- and worst-case memory usage. A totally naïve trie implementation has a ton of overhead...but then, a naïve set implementation has a ton of overhead, too. So it's important to not be naïve. ;-)

(related: looks like the redis guys have been oddly resistant to adding tries, despite several requests: https://code.google.com/p/redis/issues/detail?id=110)

Robin_Message · on Feb 16, 2011

(These are all excellent points and it's been a pleasure to discus them with you.)

Yes, it should have been O(N * k^2). I just didn't think it was as bad as O(N^2). (Of course if you switch to Unicode, your trie gets even more exciting ;)

I guess suggestions on search queries is more complex since you're now talking about sentences. I'm not convinced that web text is nothing like the system dictionary though – that dictionary has 300 thousand words, which dwarfs even the most advanced user of the English language's lexicon. Also, proper nouns and new words still follow the phonetic structure of your base language, so will compress well, but you're right on acronyms and non-words. Do you have any references on the size of real-world vocabularies? Obviously personal experience with implementing such systems counts for a great deal.

Finally, I agree that a non-naive trie is the best data structure for this (in fact, they seem to be the best data structure full-stop – some tries beat hash tables on lookup by index) but if you are building on the primitives you've got, sorted sets don't seem too bad.

jedsmith · on Feb 15, 2011

As a UX note, I've always quietly loved autocompletes that aren't just a flat list of terms, but actually contain structured information organized in an intuitive fashion. The design of the suggestions on SeatGeek is fantastic.

ericwaller · on Feb 15, 2011

Thanks, very much agreed. I think Spotlight on OSX is probably the first place I saw it done really well. I remember on XP I used to use search all the time, but now on OSX I just autocomplete to what I want probably 95% of the time.

ncavig · on Feb 15, 2011

Would be cool to have an example of this running on websockets and get rid of the request/response latency that most autocompletes have. Keep the socket open when the text field is focused and you should be able to cut down the response time even further without that overhead.

ahrjay · on Feb 15, 2011

Considering chrome and firefox are switching off websockets until they resolve the security issues that wouldn't be very viable. I guess you could fallback to flash sockets or long polling.

dmix · on Feb 14, 2011

The example autocomplete on Seatgeek.com is indeed impressively fast.

I might have to use this in my next service.

ericwaller · on Feb 15, 2011

Thanks. If you do, definitely let us (SeatGeek) know how it goes.

kscaldef · on Feb 15, 2011

Quick bit of feedback: I thought it was odd that I couldn't type in my city name in conjunction with a band name to pare things down to just the show in my town. Instead, it shows no results.

ericwaller · on Feb 15, 2011

Very good point. I think optimally a really solid autocompleter allows you to bypass search as often as possible, and location awareness is definitely an important part of that.

dhruvbird · on Feb 15, 2011

How many phrases of length 30 could you handle with 1GB of RAM?

Or do you have numbers on the mean length of a phrase you handle currently, the number of such phrases and how much memory it takes?

kin · on Feb 15, 2011

Awesome work guys! I really hope this type of UI becomes more widespread. On a side note, I've always thought the guys over at www.glyde.com execute it quite well.

siculars · on Feb 15, 2011

Oh, oh yes. Thank you kindly. I was literally about to embark on this very feature. Let's take a closer look...

henriklied · on Feb 15, 2011

Great stuff!

In curious: What do you think about exposing this service via WebSockets? Would that make it even faster?

kingkilr · on Feb 15, 2011

The UI on seatgeek is almost identical to what rdio provides, I wonder if rdio is using it.

Detrus · on Feb 15, 2011

My first few searches took a while, then every search was pretty fast. Is it the load?

jarin · on Feb 14, 2011

How ironic, this is perfect for the dating site I'm working on :)

ericwaller · on Feb 15, 2011

Glad to hear the name works for someone. We thought it meshed nicely with some of the other hyperbole out there in ruby gem names (god, unicorn, thor, shotgun, etc.).

jbendotnet · on Feb 14, 2011

Nice work.