Hacker News new | past | comments | ask | show | jobs | submit login

> 8 bytes * 400 million phone numbers < 3 terabytes of data

I get 3.2GB, which fits in RAM.

So yes, I guess, but why hit the disk at all? I'd probably just sort the data and do a binary search.




I must be sleep deprived. 1024^3 = gigabytes, not terabytes. Wow, then they've definitely over-engineered it, and disk isn't even necessary in the event of a 10-fold increase in the database size. Tries, search trees, or properly sized hash tables on phone number prefixes should be plenty fast and the whole thing can fit in RAM on a modern laptop...


Our solution maintains the data in a simpler scheme because keeping the data in a maintainable format is important aspect when the data is ever changing - every couple of days millions of people update their preferences and we need to be able to apply these changes. When choosing a single data structure for all the numbers, we need to optimize for space or for faster inserts/updates. Our solution is a trade-off between these options.


A hash table is more than compact enough. It should work fine even at high load factors (80%/90%) given a decent hash scheme (eg. Robin Hood hashing) which means you need only 4GB memory. It's random access and on average should have one cache miss per lookup. Updating is rather trivial.


Sounds promising. Little bit concerned on the computational cost of hashing function. Gotta explore this option!


I wouldn't be worried about hashing, either, given you effectively want to hash a 34 bit integer to a ⌈log 4e8⌉ = 29 bit integer.

An easy way is to take a 32 to 32 bit hash (eg. http://stackoverflow.com/a/12996028/1763356), calculate

    hash(top 32 bits) ^ hash(bottom 32 bits)
and take the bottom 29 bits of that. I wouldn't be surprised if you can do this in less time than an integer division.


Good catch. I've recently resorted to using wolfram alpha for this sort of math, since it does an excellent job of keeping units straight:

http://www.wolframalpha.com/input/?i=64+bits+*+400+million




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: