How we store 400M phone numbers with fast lookup

idorosen · on July 20, 2015

Their solution seems over-engineered for the scale of data they're concerned with.

A 10 digit phone number can be encoded in 34 bits. Supposing they used 64 bit wide fields to store the key and target data, that leaves 30 bits (more than their 2 bytes) to play with for flags or whatever target data. To store the whole dataset in a trie or tree-like structure on disk in this model they would need 8 bytes * 400 million phone numbers < 3 terabytes of data. So, a single 3TB or 4TB HDD ($130USD/ea) would suffice, this would give them (in the trie- or tree-like on-disk format) an access/read time in single digit or low double digit milliseconds with an HDD, or below 1 millisecond for 3 striped 1TB SSDs ($300USD/ea), for example. The disk controller's cache and the host's file buffers/cache would work in their favor as well since there are probably plenty of hotspots in the data. mmap could make the job of reading the file/disk even easier (and faster).

The data just isn't that big by today's standards.

Since false negatives are not possible for a properly implemented Bloom filter, using a Bloom filter in front of this on-disk approach would make it even faster for the negative case, since the disk would only be hit for positive or false positive cases, the latter of which should be relatively rare.

EDIT: It might be even simpler/better/faster than a trie/tree to just store the data indexed by some hash function (even just prefix bucketing) and then sequentially scan the collision list for that prefix hash for matches to the target number, which would take advantage of the much better sequential read throughput for most commodity HDDs/SSDs so that only one random read is necessary.

Veedrac · on July 20, 2015

> 8 bytes * 400 million phone numbers < 3 terabytes of data

I get 3.2GB, which fits in RAM.

So yes, I guess, but why hit the disk at all? I'd probably just sort the data and do a binary search.

idorosen · on July 20, 2015

I must be sleep deprived. 1024^3 = gigabytes, not terabytes. Wow, then they've definitely over-engineered it, and disk isn't even necessary in the event of a 10-fold increase in the database size. Tries, search trees, or properly sized hash tables on phone number prefixes should be plenty fast and the whole thing can fit in RAM on a modern laptop...

bilalhusain · on July 29, 2015

Our solution maintains the data in a simpler scheme because keeping the data in a maintainable format is important aspect when the data is ever changing - every couple of days millions of people update their preferences and we need to be able to apply these changes. When choosing a single data structure for all the numbers, we need to optimize for space or for faster inserts/updates. Our solution is a trade-off between these options.

Veedrac · on July 29, 2015

A hash table is more than compact enough. It should work fine even at high load factors (80%/90%) given a decent hash scheme (eg. Robin Hood hashing) which means you need only 4GB memory. It's random access and on average should have one cache miss per lookup. Updating is rather trivial.

bilalhusain · on July 29, 2015

Sounds promising. Little bit concerned on the computational cost of hashing function. Gotta explore this option!

Veedrac · on July 30, 2015

I wouldn't be worried about hashing, either, given you effectively want to hash a 34 bit integer to a ⌈log 4e8⌉ = 29 bit integer.

An easy way is to take a 32 to 32 bit hash (eg. http://stackoverflow.com/a/12996028/1763356), calculate

    hash(top 32 bits) ^ hash(bottom 32 bits)

and take the bottom 29 bits of that. I wouldn't be surprised if you can do this in less time than an integer division.

jewel · on July 20, 2015

Good catch. I've recently resorted to using wolfram alpha for this sort of math, since it does an excellent job of keeping units straight:

http://www.wolframalpha.com/input/?i=64+bits+*+400+million

IanCal · on July 20, 2015

Absolute laziest approach, 10 billion numbers entries laid out sequentially. At 2 bytes for each entry, that's only 20GB. That's less than I can fit in some fairly cheap consumer box.

Would that work? Just an array in C. Or are there interesting performance problems I'd be likely to run into?

MichaelGG · on July 19, 2015

Similar issue in doing LNP lookups in the US. There's about 500M entries, each mapping to a different LRN. Naively, that's 16 bytes per entry (8 for the number, 8 for the LRN), so 8GB.

First optimization is changing the LRNs to a lookup, as there's only ~30K uniques. So that cuts the data to 5GB. That's OK but it'd be better if it was smaller and quicker to load.

I implemented a delta-encoding system with multi-level indexing, ISAM style. Data's compressed on a "page" basis, like 4KB or so of data at a time. That way all data is compressed while in memory and lookups can be done on the compressed data directly. A header on each page provides the range it covers, so you can quickly skip to the next page if the index wasn't precise enough.

Delta encoding is neat here, because on average, there'll only be gap of 18 numbers from one entry to the next (~9B possible numbers / 500M). I used base128 encoding but simple16 or another way (there's SIMD-optimized formats) would have been even better.

Using this method, the entire dataset only requires about 600-800MB of RAM.

donarb · on July 19, 2015

Their basic premise is all wrong. The list of people that actually want spam phone calls would be very short and is a faster search target.

20kleagues · on July 20, 2015

If it is opt-out, then won't they have just the list of people who don't want phone calls?

maerF0x0 · on July 20, 2015

assuming a saturated numberspace, 400M numbers is just 5% of all possible #s . I doubt that the exceptions (911-xxx-xxxx) reduce the number enough to make a whitelist a better choice.

fencepost · on July 19, 2015

The thing that strikes me is that while the data being searched for has been optimized down to 2 bytes of bitfields, the same hasn't been done with the phone numbers - area codes in India appear to be variable-length (2, 3, ? digits) with that length being part of the 10 digits, and presumably mobile numbers are handled similarly either with their own area codes or overlapping. Breaking the data by area code would allow the elimination of significant chunks of the total possible address space, and further separating (as they do) into dense and sparse would further reduce the requirements (since tries, skip-lists, etc. have their own overhead).

It comes down to this: while the Indian phone system could in theory allow 10 billion phone numbers, there are not in fact 10 billion legally-allocatable phone numbers and the rules in place likely effectively reduce the legally-allocatable number to somewhere between 1-3 billion numbers. Dividing based on those rules may both reduce the problem to something requiring less engineering time and simiplify possible marketing-related factors (e.g. selling access/systems targeted only to specific area codes).

fencepost · on July 19, 2015

And after a bit more thinking while doing something mindless, it seems that other interesting areas would be the distribution of digits within each of the first 4-5 positions and perhaps looking at the bit patterns of either run-length or Huffman encoding of the first 4-5 digits.

And separately, depending on the hit rate and particularly in the sparse sections, there might be situations where it would make sense to have a preliminary lookup that simply indicated whether there were any possible matches within a prefix range - before searching, get an overview of whether there's anything to search.

Overall this strikes me as something that could demonstrate the importance of having developers aware of the environment in which something will be used. If there are going to be 2 of something, throw hardware at it. If there are going to be 2,000 of "something" instead, an extra $1000 each in "throw hardware at it" could become a real issue.

insaneirish · on July 20, 2015

Relevant: http://yourdatafitsinram.com

fasteo · on July 19, 2015

We are using a constant database for a similar use-case (MNP database), albeit much smaller, around 15 million phone numbers. It requires almost no RAM and you update the database as a whole with no downtime.

[1] https://en.wikipedia.org/wiki/Cdb_(software)

vidarh · on July 19, 2015

I'm sure that works great for 15m. (EDIT: To be clear, I love CDB, and I'm not being sarcastic - for 15m it's probably perfectly fine; it's just space inefficient for large datasets where the combined key+value lengths are very low)

But even if we ignore the hash table pointer tables which can be made arbitrarily small at the cost of probing a higher number of entries on average before finding the right key (but in reality you'd want them to be fairly large), the minimum space used per entry is [1]:

8 bytes for the length of the key and length of the value + 5 bytes for the key + 2 bytes for the value, so 15 bytes per number, or 6GB (EDIT: fixed numbers to account for BCD encoding of number). Which means you'd need to switch to one of the (non-standard) 64-bit CDB-inspired formats, as CDB itself can only handle 4GB data files.

[1] CDB format: http://cr.yp.to/cdb/cdb.txt

waterside81 · on July 19, 2015

For the gophers in the crowd, a CDB implementation in pure Go. Memory mapped, thread safe.

https://github.com/repustate/go-cdb

cybernytrix · on July 20, 2015

This scale is pretty small and can be done on my laptop. If you need to do such operations at a truly large scale, this is an interesting article: http://alexbowe.com/rrr/ C++ implementation: https://github.com/simongog/sdsl-lite/blob/master/README.md

anilgulecha · on July 20, 2015

Wouldn't some manageable amount of RAM and a hash table be the simplest and fastest solution?

d_e_solomon · on July 19, 2015

Thanks for posting this article. Would this be a good use case for a bloom filter?

_3u10 · on July 19, 2015

Maybe, probably not thought as a completely naive solution to this problem an array of 400M elements which is sorted, only takes 2.4 GB of RAM, or about $50 worth of RAM.

Basically you'd waste more time (and money) than it could possibly be worth.

mamp · on July 19, 2015

You could get away with much less memory using Redis hashes with a few more lines of code: http://redis.io/topics/memory-optimization.

bpicolo · on July 19, 2015

I mean, a bloom filter isn't super difficult, but it's also probabilistic instead of deterministic.

_3u10 · on July 19, 2015

Exactly, so you'd need the array anyway...

Also, typically bloom filters don't come out of the box with the language you're using, so it's just more potential for bugs.

A lookup on a sorted array should take 8.6 comparisons anyway, I bet the hashing takes longer...

d_e_solomon · on July 19, 2015

That's true, good call.

_justin · on July 20, 2015

I had interned in a similar company, they had this sorted out way back then, using bloom filters. This is the github link if anyone is interested: https://github.com/mobmewireless/fast-filter

timeiscoffee · on July 20, 2015

Article seems down. Here is a google cached version:

http://webcache.googleusercontent.com/search?q=cache:http://...

bilalhusain · on July 19, 2015

Earlier submission tanked on HN and the link didn't work, resubmitting with lookup timings added. Hopefully, readers will find it interesting this time.

nly · on July 19, 2015

> Unfortunately, 10 digits won't fit in 32 bits and 5 * 400 MB for storing number is not a very happy situation and it is NOT readily searchable.

Why is 7 bytes (2b prefs + 5b number) x 400M = ~2.4GB of RAM not good enough?

tacos · on July 19, 2015

Exactly. My response to the headline was "Um... using ten lines of C and $44 worth of RAM?"

yason · on July 19, 2015

Because the interesting part of the effort is doing it fast in anticipation of translating the optimization tricks to cases where you can't just go buy more RAM anymore. A lot of problems can be "solved" with current hardware because accesses or comparisons or whatnot become ridiculously cheap but that's not what computer science is about.

_3u10 · on July 19, 2015

This is one of the biggest problems I run into in 'software engineering', people building way to complex systems because they are 'interesting' rather than buying $50 of RAM and being done with it, and then poo-poo'ing systems that cost $50 and work.

Querying a DNC list is not a problem in which you will ever not be able to buy more RAM, it's trivially parallel, if for some reason DNC lists ever outpace Moore's law, just buy another system.

To be fair to the authors at least they didn't do something ridiculous like build a 100 note cassandra cluster.

icebraining · on July 20, 2015

This is one of the biggest problems I run into in 'software engineering', people building way to complex systems because they are 'interesting' rather than buying $50 of RAM and being done with it, and then poo-poo'ing systems that cost $50 and work.

Funny; one of the most common complaints about the software industry (common on HN) is that people use inefficient languages or algorithms and then waste too much hardware.

_3u10 · on July 21, 2015

Yup, there's certainly a balance to be achieved.

wtallis · on July 20, 2015

And realistically, there's nothing you can do with phone numbers that even needs the full speed of RAM. A fast SSD can do enough random reads to load the entire database in under 20 minutes, and nobody actually needs to know the status of all 400M numbers in the same 20 minutes because they can't all be dialed that quickly.

_3u10 · on July 20, 2015

Exactly: https://github.com/twitter/fatcache

  A single fatcache can do close to 100K set/sec for 100 bytes item sizes.
  A single fatcache can do close to 4.5K get/sec for 100 byte item sizes.
  All the 8 fatcache instances in aggregate do 32K get/sec to a single 600 GB SSD.

nulltype · on July 20, 2015

Yeah, looking at the graph in the article is really weird. Wait, you can query 10k phone numbers per second with a single crappy linode machine, why does it need to be faster than that?

wtallis · on July 19, 2015

Or one line of C and $500 of RAM (assuming we're going to spring for registered/ECC memory).

vidarh · on July 19, 2015

Now do fast lookups. It needs an index, which is what their post is about.

_3u10 · on July 19, 2015

It doesn't need an index, it just needs to be sorted...

Then you can do fast lookups...

vidarh · on July 20, 2015

You still won't get fast lookups without an index.

And if you've first sorted it, you can save space with an index in the form of a trie or limited skip list by eliminating common prefixes.

latch · on July 20, 2015

not to mention that it's so little memory that you can treat the live version as read-only and swap in new versions.

eva1984 · on July 20, 2015

Interesting article.

The idea of two-layer index actually reminds me of the very similar compression technique adapted by Roaring Bitmap.

halayli · on July 20, 2015

My intuition is that a Radix tree trie would make a lot more sense and is more maintainable than such a hack.

daurnimator · on July 20, 2015

Idea: use a hash table with perfect hashing (https://en.wikipedia.org/wiki/Perfect_hash_function, e.g. library: https://github.com/wahern/phf)

vidarh · on July 19, 2015

This is what tries [1] are for, as well as skip-lists [2]. Skip lists for a suitable sized prefix would likely be more space efficient than a trie, but some trie types such as radix-trees [3] with a sufficiently high radix value, could also be quite efficient.

The described approach doesn't seem bad, though it's a bit amusing to see this described as if it's not a very well trodden area of computer science.

[1] https://en.wikipedia.org/wiki/Trie

[2] https://en.wikipedia.org/wiki/Skip_list

[3] https://en.wikipedia.org/wiki/Radix_tree

fiatmoney · on July 19, 2015

Depending on exactly how they're implemented, tries can give you unacceptable numbers of random disk seeks / uncached memory lookups per read.

vidarh · on July 19, 2015

They can, but they certainly don't have to, especially not for a dataset like this which is small enough to re-generate offline with data suitably clustered. And certainly no reason why it'd be worse than the describe approach, which isn't really all that conceptually different. What the article described isn't necessarily bad, just unnecessarily ad-hoc when there's plenty of suitable algorithms with well understood tradeoffs and implementations.

gupta1838 · on July 21, 2015

Yeah, or you could just, like, use indexes?

Remind me not to hire these guys.

alexeyshurygin · on July 19, 2015

Does it worth mentioning? Straightforward approach of storing numbers in a hash map in memory solves the problem. This may require say 40 Gb of ram w/o ANY optimizations.

GFK_of_xmaspast · on July 20, 2015

Why would you do that when you can just keep a 10B-long bit vector.

meeper16 · on July 19, 2015

Why not OpenLDAP with MDB? It's the fastest in terms of lookups -> http://www.slideshare.net/ldapcon/whats-new-in-openldap