Hacker News new | past | comments | ask | show | jobs | submit login

I tried it out and the maximum deviation of the actual position in the file from the estimated one is a bit under five megabyte for the version 8 file. That is not good enough to just fetch a single page by quite a bit. So I guess building an index is the only real option to get the number of reads down, but even then you will have at least one read. With an SSD that will take on the order of a few hundred microseconds, so it can be done under a millisecond without relying on a cache. Binary search will require about 20 to 30 reads, so without a cache it would be about that much slower, however the first couple of steps will quickly get cached, reducing the number of reads that actually hit the hardware quite a bit.



> Binary search will require about 20 to 30 reads,

This does not pass a sanity check. If one switches to linear scan at 4KiB = 2^12 bytes, then 30 (balanced, but approximately correct for unbalanced) binary search steps would search a 4TiB file. I don't know how big the hash file is but it isn't this big. Not even 4GiB big.

The maximum number of steps is lg(file size)-12, maybe plus 2 or so to account for the unbalanced search. (NB the unbalanced search approximates the balanced only because the hashes are approximately uniform; the relation doesn't hold in general.)


The version 8 file contains 847,223,402 hashes which gives 29.7 bits, that is where I got the 30 reads from. And I assumed the most naïve approach, always just seeking and reading a single hash, but admittedly this would almost certainly not read the final page over and over again. And then I just subtracted 10 taking into account that you would read the final page only once, with 10 being a rough approximation of 12. How many bytes you actually read per read could be affected by all kind of things - sector size, cluster size, page size, read ahead, ... - so 10 bit was good enough for me. But you are right, the number of reads would almost certainly be in the vicinity of 20, not 30.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: