Hacker News new | past | comments | ask | show | jobs | submit | jolyon's comments login

Wait a second. If the IDs are all allocated in a contiguous block, and the author samples the whole space at random—then the estimate will converge to the correct value. If, on the other hand, the IDs are allocated at random, the estimate will also converge to the correct value.

If you believe that there is some structure in YouTube video IDs, that would have no effect on this experiment. It would just reduce the fraction of the total address space that YouTube can use. This is a well-known property of "impure" names, and it means there is a good chance that the IDs have no structure. In other words, the video IDs would be "pure" names.


Is 32,000 a good enough number to estimate the entirety of the Youtube’s video space? It felt to little for what they are trying to accomplish (especially when they started doing year by year analysis)


32000 is just the "cheat factor" by which they increase the method's efficiency.

I'm not sure how much the "cheating" would affect the precision of the result. But assuming it has no effect, it's easy to estimate this precision:

They found X = 24964 videos in a search space of size S = 2^64. For the number of existing videos they report the estimate N = 13,325,821,970. From this we can find their estimate for the probability that a particular ID links to a video: p = N / S ≈ 7.22e-10. So the equivalent number of IDs that they have checked (the number of checks without cheating that would give the same information) is n = X / p ≈ 3.46e13.

Since X is a Binomial, its variance is Var(X)=n⋅P(1-P) (where P is the real proportion corresponding to the estimate p above). And N = X⋅S/n so its variance is Var(X)⋅S^2/n^2. The standard deviation of N is thus σ = S⋅sqrt(P⋅(1-P)/n). Now we don't know P but we can use our estimate p instead to find an estimate of σ!

We find that the standard deviation of their estimator for the number of YouTube videos is approximately S⋅sqrt(p⋅(1-p)/n) ≈ 8.43e7. That's just 0.633% of N so their estimate is quite precise.


When you're estimating a ratio between two outcomes, the rule of thumb is that you want at least 10-100 samples of each outcome, depending on how much precision you want.

They got 10,000 samples of hits, and a huge number of samples of misses. Their result should be very accurate. (32,000 was a different number)



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: