Their solution seems over-engineered for the scale of data they're concerned with.
A 10 digit phone number can be encoded in 34 bits. Supposing they used 64 bit wide fields to store the key and target data, that leaves 30 bits (more than their 2 bytes) to play with for flags or whatever target data. To store the whole dataset in a trie or tree-like structure on disk in this model they would need 8 bytes * 400 million phone numbers < 3 terabytes of data. So, a single 3TB or 4TB HDD ($130USD/ea) would suffice, this would give them (in the trie- or tree-like on-disk format) an access/read time in single digit or low double digit milliseconds with an HDD, or below 1 millisecond for 3 striped 1TB SSDs ($300USD/ea), for example. The disk controller's cache and the host's file buffers/cache would work in their favor as well since there are probably plenty of hotspots in the data. mmap could make the job of reading the file/disk even easier (and faster).
The data just isn't that big by today's standards.
Since false negatives are not possible for a properly implemented Bloom filter, using a Bloom filter in front of this on-disk approach would make it even faster for the negative case, since the disk would only be hit for positive or false positive cases, the latter of which should be relatively rare.
EDIT: It might be even simpler/better/faster than a trie/tree to just store the data indexed by some hash function (even just prefix bucketing) and then sequentially scan the collision list for that prefix hash for matches to the target number, which would take advantage of the much better sequential read throughput for most commodity HDDs/SSDs so that only one random read is necessary.
I must be sleep deprived. 1024^3 = gigabytes, not terabytes. Wow, then they've definitely over-engineered it, and disk isn't even necessary in the event of a 10-fold increase in the database size. Tries, search trees, or properly sized hash tables on phone number prefixes should be plenty fast and the whole thing can fit in RAM on a modern laptop...
Our solution maintains the data in a simpler scheme because keeping the data in a maintainable format is important aspect when the data is ever changing - every couple of days millions of people update their preferences and we need to be able to apply these changes. When choosing a single data structure for all the numbers, we need to optimize for space or for faster inserts/updates. Our solution is a trade-off between these options.
A hash table is more than compact enough. It should work fine even at high load factors (80%/90%) given a decent hash scheme (eg. Robin Hood hashing) which means you need only 4GB memory. It's random access and on average should have one cache miss per lookup. Updating is rather trivial.
Absolute laziest approach, 10 billion numbers entries laid out sequentially. At 2 bytes for each entry, that's only 20GB. That's less than I can fit in some fairly cheap consumer box.
Would that work? Just an array in C. Or are there interesting performance problems I'd be likely to run into?
Similar issue in doing LNP lookups in the US. There's about 500M entries, each mapping to a different LRN. Naively, that's 16 bytes per entry (8 for the number, 8 for the LRN), so 8GB.
First optimization is changing the LRNs to a lookup, as there's only ~30K uniques. So that cuts the data to 5GB. That's OK but it'd be better if it was smaller and quicker to load.
I implemented a delta-encoding system with multi-level indexing, ISAM style. Data's compressed on a "page" basis, like 4KB or so of data at a time. That way all data is compressed while in memory and lookups can be done on the compressed data directly. A header on each page provides the range it covers, so you can quickly skip to the next page if the index wasn't precise enough.
Delta encoding is neat here, because on average, there'll only be gap of 18 numbers from one entry to the next (~9B possible numbers / 500M). I used base128 encoding but simple16 or another way (there's SIMD-optimized formats) would have been even better.
Using this method, the entire dataset only requires about 600-800MB of RAM.
assuming a saturated numberspace, 400M numbers is just 5% of all possible #s . I doubt that the exceptions (911-xxx-xxxx) reduce the number enough to make a whitelist a better choice.
The thing that strikes me is that while the data being searched for has been optimized down to 2 bytes of bitfields, the same hasn't been done with the phone numbers - area codes in India appear to be variable-length (2, 3, ? digits) with that length being part of the 10 digits, and presumably mobile numbers are handled similarly either with their own area codes or overlapping. Breaking the data by area code would allow the elimination of significant chunks of the total possible address space, and further separating (as they do) into dense and sparse would further reduce the requirements (since tries, skip-lists, etc. have their own overhead).
It comes down to this: while the Indian phone system could in theory allow 10 billion phone numbers, there are not in fact 10 billion legally-allocatable phone numbers and the rules in place likely effectively reduce the legally-allocatable number to somewhere between 1-3 billion numbers. Dividing based on those rules may both reduce the problem to something requiring less engineering time and simiplify possible marketing-related factors (e.g. selling access/systems targeted only to specific area codes).
And after a bit more thinking while doing something mindless, it seems that other interesting areas would be the distribution of digits within each of the first 4-5 positions and perhaps looking at the bit patterns of either run-length or Huffman encoding of the first 4-5 digits.
And separately, depending on the hit rate and particularly in the sparse sections, there might be situations where it would make sense to have a preliminary lookup that simply indicated whether there were any possible matches within a prefix range - before searching, get an overview of whether there's anything to search.
Overall this strikes me as something that could demonstrate the importance of having developers aware of the environment in which something will be used. If there are going to be 2 of something, throw hardware at it. If there are going to be 2,000 of "something" instead, an extra $1000 each in "throw hardware at it" could become a real issue.
We are using a constant database for a similar use-case (MNP database), albeit much smaller, around 15 million phone numbers. It requires almost no RAM and you update the database as a whole with no downtime.
I'm sure that works great for 15m. (EDIT: To be clear, I love CDB, and I'm not being sarcastic - for 15m it's probably perfectly fine; it's just space inefficient for large datasets where the combined key+value lengths are very low)
But even if we ignore the hash table pointer tables which can be made arbitrarily small at the cost of probing a higher number of entries on average before finding the right key (but in reality you'd want them to be fairly large), the minimum space used per entry is [1]:
8 bytes for the length of the key and length of the value + 5 bytes for the key + 2 bytes for the value, so 15 bytes per number, or 6GB (EDIT: fixed numbers to account for BCD encoding of number). Which means you'd need to switch to one of the (non-standard) 64-bit CDB-inspired formats, as CDB itself can only handle 4GB data files.
Maybe, probably not thought as a completely naive solution to this problem an array of 400M elements which is sorted, only takes 2.4 GB of RAM, or about $50 worth of RAM.
Basically you'd waste more time (and money) than it could possibly be worth.
I had interned in a similar company, they had this sorted out way back then, using bloom filters. This is the github link if anyone is interested: https://github.com/mobmewireless/fast-filter
Earlier submission tanked on HN and the link didn't work, resubmitting with lookup timings added. Hopefully, readers will find it interesting this time.
Because the interesting part of the effort is doing it fast in anticipation of translating the optimization tricks to cases where you can't just go buy more RAM anymore. A lot of problems can be "solved" with current hardware because accesses or comparisons or whatnot become ridiculously cheap but that's not what computer science is about.
This is one of the biggest problems I run into in 'software engineering', people building way to complex systems because they are 'interesting' rather than buying $50 of RAM and being done with it, and then poo-poo'ing systems that cost $50 and work.
Querying a DNC list is not a problem in which you will ever not be able to buy more RAM, it's trivially parallel, if for some reason DNC lists ever outpace Moore's law, just buy another system.
To be fair to the authors at least they didn't do something ridiculous like build a 100 note cassandra cluster.
This is one of the biggest problems I run into in 'software engineering', people building way to complex systems because they are 'interesting' rather than buying $50 of RAM and being done with it, and then poo-poo'ing systems that cost $50 and work.
Funny; one of the most common complaints about the software industry (common on HN) is that people use inefficient languages or algorithms and then waste too much hardware.
And realistically, there's nothing you can do with phone numbers that even needs the full speed of RAM. A fast SSD can do enough random reads to load the entire database in under 20 minutes, and nobody actually needs to know the status of all 400M numbers in the same 20 minutes because they can't all be dialed that quickly.
A single fatcache can do close to 100K set/sec for 100 bytes item sizes.
A single fatcache can do close to 4.5K get/sec for 100 byte item sizes.
All the 8 fatcache instances in aggregate do 32K get/sec to a single 600 GB SSD.
Yeah, looking at the graph in the article is really weird. Wait, you can query 10k phone numbers per second with a single crappy linode machine, why does it need to be faster than that?
This is what tries [1] are for, as well as skip-lists [2]. Skip lists for a suitable sized prefix would likely be more space efficient than a trie, but some trie types such as radix-trees [3] with a sufficiently high radix value, could also be quite efficient.
The described approach doesn't seem bad, though it's a bit amusing to see this described as if it's not a very well trodden area of computer science.
They can, but they certainly don't have to, especially not for a dataset like this which is small enough to re-generate offline with data suitably clustered. And certainly no reason why it'd be worse than the describe approach, which isn't really all that conceptually different. What the article described isn't necessarily bad, just unnecessarily ad-hoc when there's plenty of suitable algorithms with well understood tradeoffs and implementations.
Does it worth mentioning? Straightforward approach of storing numbers in a hash map in memory solves the problem. This may require say 40 Gb of ram w/o ANY optimizations.
A 10 digit phone number can be encoded in 34 bits. Supposing they used 64 bit wide fields to store the key and target data, that leaves 30 bits (more than their 2 bytes) to play with for flags or whatever target data. To store the whole dataset in a trie or tree-like structure on disk in this model they would need 8 bytes * 400 million phone numbers < 3 terabytes of data. So, a single 3TB or 4TB HDD ($130USD/ea) would suffice, this would give them (in the trie- or tree-like on-disk format) an access/read time in single digit or low double digit milliseconds with an HDD, or below 1 millisecond for 3 striped 1TB SSDs ($300USD/ea), for example. The disk controller's cache and the host's file buffers/cache would work in their favor as well since there are probably plenty of hotspots in the data. mmap could make the job of reading the file/disk even easier (and faster).
The data just isn't that big by today's standards.
Since false negatives are not possible for a properly implemented Bloom filter, using a Bloom filter in front of this on-disk approach would make it even faster for the negative case, since the disk would only be hit for positive or false positive cases, the latter of which should be relatively rare.
EDIT: It might be even simpler/better/faster than a trie/tree to just store the data indexed by some hash function (even just prefix bucketing) and then sequentially scan the collision list for that prefix hash for matches to the target number, which would take advantage of the much better sequential read throughput for most commodity HDDs/SSDs so that only one random read is necessary.