Hacker News new | past | comments | ask | show | jobs | submit login

...yes? I mean, how else would you do it?



I think you don't understand what you're saying.

If you were to judge the contents of a library by the size of the largest items on the shelves (exactly as you have done with torrents), you would come away with the mistaken impression that they consisted primarily of dictionaries and boxed sets of language learning CDs. In fact, these items represent a very small portion of the items in the catalog.


I agree that it's a crude measure, but I don't think the situation is quite as bad as you make out; your intuition about paper libraries is misleading you.

What's being counted as a single item here is not a single bound volume of a chemistry journal, nor the entire archive of Bioconjugate Chemistry, but rather the entire chemistry-journals wing of the library: 539 gibibytes, including 226 different journals. By comparison, the latest five items on http://webcache.googleusercontent.com/search?q=cache:http://... are 3.7MiB, 11.7MiB, 350MiB, 730MiB, and 260MiB; the chemistry-journals library is some 2000 times the size of the median of these and 120 000 times the size of the smallest, which happens to be a two-volume book called "Great Moments in Mathematics".

It turns out that when you have a power-law distribution crossing five orders of magnitude, like the one that characterizes file sizes, rather than the much narrower distribution that characterizes book sizes, you actually can get a useful approximation of the makeup of the total by looking at the makeup of only the largest items. It's surely not an unbiased estimator, but it's still a useful one.

Feel free to invest the work to do a better approximation.


We do not agree. I am not saying it's a crude measure, I'm saying you're measuring the wrong thing. File size is the wrong thing to measure. It doesn't matter what estimate of file sizes you can come up with, because file size is the wrong thing to measure.

Unless one is loading a moving van or trying to estimate the number of shelves required to store it, one characterizes the contents of a library by the items in the catalog and their subject matter, not by the volume they consume. You don't go in and ask for "a cubic foot of books" any more than you torrent "a megabyte of music".




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: