Hacker News new | past | comments | ask | show | jobs | submit login
Estimating the chances of something that hasn’t happened yet (johndcook.com)
575 points by devy on Nov 15, 2018 | hide | past | favorite | 150 comments



This reminds me of a different "rule of 3": If you want to compare two things (e.g., "is my new code faster than my old code"), a very simple approach is to measure each three times. If the all three measurements of X are smaller than all three measurements of Y, you have X < Y with 95% confidence.

This works because the probability of the ordering XXXYYY happening by random chance is 1/(6 choose 3) = 1/20 = 5%. It's quite a weak approach -- you can get more sensitivity if you know something about the measurements (e.g., that errors are normally distributed) -- but for a quick-and-dirty verification of "this should be a big win" I find that it's very convenient.


I agree that based on combination (n! / k!(n-k)!) it seems to be 1 in 20, but when you think of it as running A and B and checking if X or Y is faster 3 times, then you get 1 in 8. Quite a big difference. Where does it come from? Am I doing something wrong? I mean combination approach counts same time results as a win but given enough time precision we could skip this scenario.

edit: Ah, got it, in my case 3rd result for X can be higher than the first result for Y. You said all times must be smaller.


Let's say there are 6 trials -- name them x1, x2, x3, y1, y2, y3.

In the case of checking if x is faster than y 3 times, you're doing x1 < y1 AND x2 < y2 AND x3 < y3.

In the case of checking if all three measurements of x are smaller than all three measurements of y, you're checking x1 < y1 AND x1 < y2 AND x1 < y3 AND x2 < y1 AND x2 < y2 AND x2 < y3 AND x3 < y1 AND x3 < y2 AND x3 < y3.

In other words, the latter case is checking whether the slowest x of three trials is faster than the fastest y of three trials.


You're right. I actually edited before you replied. Should have deleted it or preferably think more before typing. But now it's forever.


I'm glad you kept it. The explanation was enlightening to me.


Man, I wish I understood frequentist statistics to know if your reasoning makes sense. Bayesianly, if O = "the XXXYYY ordering", and F = "algo B is faster than algo" A, then

    P(F|O) = P(O|F) * P(F) / P(O)
then... what? There isn't even a clear P(O|F) likelihood without making assumptions about the process' noise. If the measurement is very noisy compared to the gain, then XXXYYY is just dumb luck, and doesn't tell you anything. If there is no noise at all, then just getting XY is enough to make a decision.


The OP is making some short cuts. 1/(6 choose 3) means that they start with the assumption that every possible ordering is equally likely. If that is the case, then the odds that XXXYYY pops out is 5%. What happens if we change our assumption? Let's assume that XXXYYY is more likely. This would imply that algo B is faster than algo A. If we assume that any of the other (or all of the other) combinations are more likely, then this reduces the odds that XXXYYY would pop out.

This means that either we had it right (algo B is faster than algo A), or the result we saw was at least as unlikely as we predicted. That's what it means to have a "confidence interval".

I'll leave the Bayesian version to someone else because I don't really trust myself to do it.


> they start with the assumption that every possible ordering is equally likely

If the times in each run are independent, this assumption is (for some distributions) the weakest form of "X is not faster than Y."

For a distribution where this is not the case: assume

- X always runs in 999 seconds, and

- Y runs in 1000 seconds in 99% of runs and in 0 seconds the other 1% of the time.

Then XXXYYY is a very likely ordering (~97% chance), though Y runs faster than X "on average". (Not in median or mode though.)

For a more concrete example: say you have two sorting algorithms, one that is a little slower most of the time, but worst case O(n log n), and another that is usually a bit faster but can be O(n^2) on pathological input.


> then XXXYYY is just dumb luck

That's the idea. The null hypothesis is that this is dumb luck. Here's a piece of evidence that, if this is dumb luck, is not very likely. Ergo, the odds this is just dumb luck is low and there may be a real effect here.

As the OP said, it's not meant to be used in a scientific paper, it's to let you see if you might be on the right track.


Let me explain it in a Bayesian way. The Bayesian approach would be to consider a whole bunch of hypotheses about how much better Y does than X (and vice-versa), and a prior distribution over how likely you think each is a-priori. Then for each sample S, you update the probability of each hypothesis H by multiplying its probability by P(S|H), then re-normalize.

Well in this case, we're only going to consider two hypotheses. Call them H0 and H1. H0 is the "null hypothesis", and says that X and Y are exactly as fast as each other. H1 is the hypothesis that H0 is wrong and Y is totally faster than X. You'll notice that H0 is oddly specific, and H1 is ill-defined. Don't worry about it.

To start off, pick your prior distribution over H0 and H1. Pick whatever you want, because we're going to ignore it shortly.

Now some evidence comes in. Time to update! We got the ordering XXXYYY. First, let's update H0. P(XXXYYY|H0) = 1 / 6choose3 = 5%. Wow, that's not a very good update for H0. It's probably just false. For expediency, let's just toss it out.

H1 is the remaining hypothesis. H1 wins! Y is faster than X.


“P(XXXYYY|H0) = 1 / 6choose3 = 5%. Wow, that's not a very good update for H0. It's probably just false. For expediency, let's just toss it out.”

It only makes sense to toss H0 out if P(XXXYYY|H1) >> 5% (such that the evidence for H1 relative to H0 increases after the observation).

You are implicitly assuming that’s the case because “it makes sense”. But as the parent post mentioned, the likelihood is not defined and in particular if the noise in the observation process is large enough, P(XXXYYY|H1) may be very close to 0.05 as well.


Yes: for that reason and others the whole approach doesn't make much sense from a Bayesian perspective. I was trying to point that out by running with it.


> There isn't even a clear P(O|F) likelihood without making assumptions about the process' noise.

Correct. I believe standard procedure would be to assume that time for algo A/B are normally distributed, put non-informative priors on the parameters, then integrate over the space where F is true. I think the non informative prior for a normal distribution is P(mu, sigma^2) \propto 1 / (sigma^2) - it's harder than this, because you know that the runtime of a program is > 0, and maybe you think that sigma is likely to be quite a bit less than mu. If you choose your prior wisely, I suspect you will actually get something close to 1 / 20, the added uncertainty will come from when |mu_A - mu_B| is roughly less than max(sigma_A, sigma_B), but even there, most of it will cancel out.

A normality assumption may be reasonable - if you're assuming that the run-time of the code is dominated by the addition of a bunch of independent operations that are roughly the same timescale, that's what you will get - it will fall down if there is a small number of steps which dominate (e.g. http requests or garbage collection or something). If you don't want to some kind prior with parameters like that, you need to go into non-parametric Bayesian stats, and you'll end up with a lot more uncertainty.


Going down this rabbit hole eventually leads you to nonparametric statistical tests, e.g. Mann-Whitney-U and so on.


Right. And those are also more powerful (and don't need any knowledge of the distribution of measurement errors).

I mentioned the "three old and three new" test because it's simple, not because it's powerful.


more powerful in the statistical sense? I though non-parametric tests were usually less powerful than those where a certain distribution is assumed.


You would be surprised what number of samples does to the tests. Take t-test -- the most powerful test to check if means of 2 equivariant Gaussians differ. If you compare the asymptotic efficiency of Mann-Whiney (a distribution free test) relative to t-test is around 0.96. Of course in practice you will not have infinite samples. It then comes down when do these asymptotics kick in. Unfortunately that depends on the distribution.


That depends on whether the assumed error distribution is accurate. Obviously, if the distribution is known, you will do better by incorporating it. But assuming Gaussian errors when the data doesn't match can lead to bad analyses.


My personal favourite quick-and-dirty trick: A quick way to estimate any distribution's median is to draw 5 random samples. There's a >90% chance that the median is between the biggest and the smallest value.


The final bit is complete bunkum for an unknown distribution. Whether the distribution is monotonic and if not, how symmetric, is the major determinant of occurrence of such result. Giving a flat number is completely bogus.


I don't understand what you mean. It doesn't actually matter what shape the distribution takes. That figure is derived entirely from what the "median" means, i.e. a random sample has 50% chance of being greater than the median, and 50% chance of being lower than it.

A bit of maths would show that the probability that all 5 samples lie entirely above or entirely below the median (i.e. the median is NOT between the greatest and the smallest value) is 1/16. That gives you a 15/16 (= 93.75%) chance that the median is contained within the bounds.


I wish the people constantly complaining how n=20,000 is "far too small a sample size to call this science" (for every empirical study) would take not. Effect size matters!


An absence of statistical training is dangerous. A little statistical training remains dangerous, but gets annoying as well


Is this a good method to use while trying out a lot of ideas? Usually when I am optimizing code, most of the ideas don't work out and performance remains roughly the same (or so I think - I don't really know and want a better workflow here).

But if you do this test repeatedly, even if the code had identical performance, you'll get a false positive 5% of the time. And depending on the spread of the timings you might not get a clear XXXYYY win even when it is indeed a minor improvement.


sqlite famously squeezed out a ~40% performance improvement (I think from v3 to v4?) by just combining tons of micro-optimizations of this kind where it wasn't obvious if each change even made an improvement. They measured the performance with cachegrind in order to identify very small improvements that get lost in the normal measurement noise.


Also check fishcooking: http://tests.stockfishchess.org/tests which is how Stockfish became the indisputably strongest chess engine: Everyone can commit patches which are then tested in thousands of games, and if they make Stockfish stronger, they are accepted.


That’s a really awesome idea. Also, maybe a few of the performance boosts in the final combination are mutualistic, but if measured independently would not make a difference.


I would love to read more about this. Do you have a link?



Statistics is a tricky beast. If 3 runs gives you a 95% chance that your change is an improvement, and you use this process on ten different improvements, there's a 40% chance that at least one of them is bogus.


Would love to see a refutation from a downvoter.


I believe you have it right. Odds that all changes are improvements in 10 changes is: (.95)^10 Odds that at least one change is not improvement is: 1-(.95)^10 = 0.40

One could probably make an argument that not all trials are independent, but not sure what else down-voters saw.


It is not a good method to use in that case.


> If the all three measurements of X are smaller than all three measurements of Y, you have X < Y with 95% confidence.

On a multitasking system, you should use the fastest benchmark run (assuming you're running the same code on the same data in all cases, and if you're not then you're not really benchmarking). This will be the one that is least influenced by any other processing going on.


That's the case when you think the measurement noise is only positive. But are there kinds of noise that reduce your benchmark time?

On one hand, I believe a piece of code has a true benchmark time, but we only measure noisy versions that are skewed positive due to multitasking noise. On the other hand, it seems important to make benchmark measurements in the context of the whole system. If A beats B in the best case (unloaded system), but doesn't win in the average case (taking the context into account), then that's important information.


The way I think about it is that benchmarks include deterministic operations (that happen every time) and nondeterministic operations (that don't). And all of these numbers are positive, since there are few things in the world that take negative time.

By taking the minimum, we focus attention on the deterministic operations. This isn't likely to be realistic for production, but production won't look much like your benchmark hardware anyway.

When making code changes, deterministic operations are the part you have most influence over and have the most impact. Removing an unnecessary, deterministic operation will speed up every run, not just some of them.


In the specific case of code, it is worth noting that runs are typically not independent, because of caching.

It is very common for the first run to be slower. It means that by "random chance" slow-slow-slow-fast-fast-fast is more likely than fast-fast-fast-slow-slow-slow. I personally tend to discard first runs as outliers when profiling.


This is incorrect: there's no reason to expect that X and Y will each appear 3 times in 6 trials if their probabilities are equal. If all 3 measurements of X are smaller than all 3 measurements of Y, then you have X < Y with confidence 1 - 1/8 or 87.5% confidence. You'd need at least 5 measurements to be 95% confident.


You're not considering the right probability space. We have 3 measurements of X and 3 of Y. The question is the distribution on orderings of these six measurements. If X and Y come from the same distribution then all orderings are equally likely.


Ah, I failed to consider that. The original post is correct.


I'm also having trouble with this. On the face of it the quick and dirty "XXXYYY" test outlined above looks good but are these two following statements consistent? ie are the run times of X (new code) and Y (old code) really from the same distribution.

"is my new code faster than my old code"

"If X and Y come from the same distribution then all orderings are equally likely"


I'm thinking of it as a statistical hypothesis test. The null hypothesis is that they come from the same distribution. Under that hypothesis, there's only a 0.05 chance of seeing three X tests all below three Y tests. So if we see this, we can probably reject the null.

If we think X and Y distributions are both something like normal with similar variance, then we should also be able to say the chance of XXXYYY given Y is better than X is at most 0.05.

But if the distributions for X and Y can be really different, then I think you're right -- this test could be misleading! For example, say Y always takes 2 seconds, and X takes 1 second 90% of the time, but 1% of the time it takes an hour. If we run three tests of each, we'll probably only see good runs from X and conclude it's better, when it's not.


Well, you’re the one measuring each three times. So this should always be the case?


If the author is reading this: When someone provides you a pro-bono translation, by all means credit them, but do not let them host it on their own site.

Frequently, they are siphoning your PageRank. They will eventually replace the translation with monetized content of their choice. A good translation takes work! If you got a translation for free, why should you believe it's good, or that it has no ulterior motive?

A firm called "WebHostingGeeks" used to do this all the time. They would offer free translations of blog posts, into languages the author probably didn't speak (and they didn't either, they were just using machine translation). They'd ask authors to link to the translation on their site, and over time they would add their SEO links to the translation.

I first noticed this when WebHostingGeeks offered me a Romanian translation of ConceptNet documentation, my roommate spoke Romanian, and he said "maybe I'm not used to reading technical documentation in Romanian but I think this is nonsense".


I am Italian and can confirm that the translation is perfectly fine. At the very beginning of the translation, in italics, the translator states that he follows Endeavour because on that blog he finds interesting articles about statistics and programming. After that he gives full credit to the author and back links the original post. To me there's nothing of suspicious or malicious.


I'm glad that this is a good example of a translation, and not a trick.

I still think that if a blog post author says "this is a translation of my post", the author should host it as a sibling page on their own server. That allows them to more reliably vouch for its content, and removes an avenue that can be used for trickery.


But this one doesn't looks bad, the Italian translator even goes beyond the article by providing visualizations of the concept.


You are exceedingly suspicious for no reasons, people just like to translate articles that they enjoyed reading, the vast majority of times that's the only motive.


reader is not the culprit. The website translating the content for free could have its own good/bad intents.


The obsession with SEO is what leads to things like this in the first place.

Maybe the sooner people stop paying attention to it, the sooner search engines will learn to find better metrics.


You mean better metrics that can be optimized?


Yeah this is weird. Also, if OP doesn't speak Italian themselves, how can they attest to the quality of the translation?


I have a couple L2’s that I mostly speak and read, virtually no writing - my grammar is poor and my vocabulary is limited. Because of this I would feel uncomfortable/embarrassed to translate any tech stuff I wrote to it. However, I could read someone else’s translation and know if it’s way off the mark. Also, I have plenty of friends that speak the language natively, so I could ask them to review. If they say it’s good, I’d vouch for the translation because I trust them (even if I couldn’t read it at all).


Oh ok cool :)


I'm Italian, for what it's worth the translation seems fine, even if the writing quality is not as good as the source and maybe the joke about "Bayesians being allowed to make such statements" is a bit lost


What the author glosses over somewhat is the method of sampling. If you read the first 20 pages, find no typos, and use this rule to arrive at 15%, that could be way off. He's assuming the risk of typos are evenly distributed when there's a lot of reasons it may not be. For example, the first half of the book could've been more heavily proof-read than the latter half. It's not out of the question that editors get lazier the farther they get into the book.

If you were to randomly read 20 pages in a book and find no typos, 15% probability makes more sense.

It's understandable to not mention this in a short blog post about the rule of three, but never forget that when you're interpreting statistics...how you build your sample matters.


In the case of a book the first pages are likely to be significantly better than the rest. An editor knows that once you have invested in reading the first part of the book you are unlikely to put the book down. This means the first pages have to suck you in, and not do anything to make want to quit reading. The most important part is the first sentence as this is often the only part that drives your buy/leave in the store decision.


>The most important part is the first sentence as this is often the only part that drives your buy/leave in the store decision.

I usually sample books from the middle. I wonder what the heatmap of bookstore reading really looks like?


My informal sampling suggests most people start at first page. They worry that reading the middle or end would reveal a spoiler. It would be interesting to get good data though.


Doesn't the book example need to factor in the book length? If you read 20 pages, and it's 20 pages long, you should be 100% confident, but if you read 20 pages and it's 5000 pages long, your confidence should be near 0.


Not in this case, because the probably being estimated is "the chance a single page could contain a typo" and not "the chance the rest of the book could contain a typo."

To do the latter, you'd need to know the length of the book, yes.


In effect it does. If you have read 20 pages and found no typos you estimate the chance of an unknown page having a typo is 1 in 20/3. But if there is 0 unread pages then there 0 cases to apply the probability to. Make it 100 pages, then you have 100 cases each with a 1 in 20/3 probability. Of course if you read 30 more and still don't find a typo, it's now 50 remaining cases each with a 1 in 50/3 (estimated) probability.


Logically not, since your sampling method is imperfect and will not detect all typos and, in fact, some typos might make perfect sense given the context and be inherently undetectable forever.


Could we apply the rule of three for the larger context here? Something along the lines of: "if I haven't seen a book with unequal typos distribution yet, what are the odds this current one will be an instance where editors got lazier in the second half?"


You could, though that wouldn't be a very powerful test. You could probably construct a more powerful test if you're going through the trouble of reading that many books.


> It's understandable to not mention this

I feel like the very first word, "Estimating" made what you're talking about pretty clear. It's extremely common for estimates to make simplifying assumptions that don't necessarily hold if you care about accuracy or specific situations. He also called the rule of three "quick and dirty".

> If you were to randomly read 20 pages in a book and find no typos, 15% probability makes more sense.

OTOH, if typos are actually uniformly distributed, then reading the first 20 is better than a random 20 unless you take care to pick random pages without repeats.


Interesting that you would worry about repeats. Most people would interpret "randomly read 20 pages" as "choose 20 from n" not "take 20 rand(n)".


You're right, and mathematicians and computer scientists aren't most people. We are being pedantic / precise here. If someone wants to caution that people need to be really careful about estimates because there are some assumptions involved, then it's fair to hold them to their own standard, isn't it? If typos are uniform, then the suggestion to read random pages doesn't help. And whether or not people interpret it one way most of the time, if you interpret "random" like a programmer, you've introduced a small amount of bias, right?


As a programmer it still never would have occurred to me to interpret it that way, but maybe I'm not typical. If your random library doesn't have a choose function, it's easy enough to just shuffle a list and take the first 20.


In the same vein if the weather is good today, there is s large probability that it’s good tomorrow, but if it’s been good for 14 days there is an increasing probability of bad weather.


If the process is truly random however, then that no longer holds and you risk falling prey to the gambler's fallacy. If you flip a coin and get heads 14 times, you're still equally likely to get heads or tails the 15th time, despite the probability of getting heads 15 times in a row being very low.


Why would this be true? What if we, by luck, had 14 days of good weather at the end of monsoon season? Your observation seems to be a function of the probabilities changing as the calendar date advances and not a historical run of good weather in the past.


Why would this be true?

It's not impossible, just location dependent. In some places, there actually is a clear cycle, where the weather alternates between good and bad on a weekly cycle. The Bay Area of California, where HN is headquartered, seems to be one of those places:

The Fog Cycle

Week by week from spring into August, the forces that produce the fog increase in intensity. The Pacific High moves farther north, closer to the latitude of San Francisco, sending out stronger winds; offshore the up-welling of cold bottom waters increases, condensing the winds’ moisture into thicker masses of fog; in the Central Valley, the northward-moving sun sends temperatures to the 100 degree mark and beyond. The hot air rises, sucking cool masses in great drafts through the only break in the Valley’s surrounding mountains, San Francisco Bay. With the ocean air comes the fog, evaporating gradually in the hot, dry air of the Valley, but sometimes penetrating at night as far as Sacramento and Stockton.

The fog seems to come and go in cycles. Until recent years, the conventional explanation for the fog’s behavior was a simple one: as the cool, fog-bearing ocean air is pulled over the coastal hills and across the Bay toward the hot Central Valley (that is, from a high-pressure to a low-pressure area), the nearest parts of the Valley begin to cool off after a few days, much as a draft from an open door lowers the temperature in a warm room.

The incoming cool, heavy sea air replaces the warm, rising land air, and temperatures in Sacramento and Stockton may drop from well above 100 degrees to the “cool” 90s. When the Valley cools sufficiently, the fog-producing machinery breaks down. Without the intense Valley heat to draw the sea air in through the Bay Region, the wind diminishes and no longer carries the fog inland. San Francisco, the Golden Gate, and the coastline are fog free.

Then the process starts all over again. Without the incoming wind and fog, the sun gradually reheats the Valley. The rising warm air again begins to attract the foggy marine air inland. The result is a fog cycle of about a week in length, producing roughly three or four days of fog over the Bay and three or four days of sun.

https://baynature.org/article/cutting-through-the-fog/


Today's weather affect's tomorrows. They are not independent variables.


Weather poses an additional problem in that it is not fully observable. It is only partially sampled and contains unknown unknowns still.


Or you're more likely to be in an area with a pleasant climate


I've seen a very similar problem to this referred to as "black swan events"[1]. The whole point is you can't actually compute it. You see the period, it's there. What he's doing here isn't science, it's a guess.

As others have pointed it, it's rather rare that events happen with perfectly distributed probability (probably the opposite). For instance, the chance of being in an accident is much higher if an accident just occurred right next to you. It's almost way more likely to get sick, if others are already sick. In fact, although I don't have any statistics to back me up, I'd guess that most events happen in clusters (including spelling errors, or when you test for perfect pitch in children, when you go to the music class).

This is essentially a guess, and it's better to say you don't know than guess wildly.

[1] https://en.wikipedia.org/wiki/Black_swan_theory


>The whole point is you can't actually compute it.

You can't compute the probability. But you can compute an upper bound on the probability with a reasonable confidence, which is what the author is doing here.

This is standard statistics and might be useful in some cases. It does assume that the events are independent, but this is a pretty standard assumption that you have to check whether it applies in your case, or at least approximately applies in your case.

>What he's doing here isn't science, it's a guess.

Which is a pretty big thing in this subfield mathematics called statistics.


> it's better to say you don't know than guess wildly.

> although I don't have any statistics to back me up, I'd guess that most events happen in clusters

More seriously, yes, before you apply this rule you should think about whether "each event has the same probability of error" is appropriate for your situation.


I'm nitpicking, but there is not even a chance that this could be "science" with any coherent definitions of science and future. For example there is Karl Poppers falsification standard. Future is never falsifiable as long as it remains in the future.

Another approach would be this: Future is impossible to research, as you can only research entities that exist or phenomena that is happening. Future entities and future phenomena do not exist yet ( by definition ) so you cannot research them. -> no science of future.

What we are talking here is scientific prediction. "Scientific" is just another word for good and only means something when compared to another method of prediction that can be shown to be worse.

So we very much agree. I just wanted to write this out because the Future studies crowd has bothered me for some time now.


So basically the '3' comes entirely from the choice of a 95% confidence interval. If you want a 99% confidence interval it's instead the 'rule of 4.6', which doesn't roll off the tongue as well.


You could do the 'rule of 5' though and have a confidence of 99.3%, which is pretty close to 99.


If you're truly estimating, though, wouldn't you use a 50% confidence interval — which gives you the rule of 1 (chances of the thing being true are 1/n — if you've seen 20 pages without a typo, chances are 1/20 that there's a typo on a page, with 50% confidence)?


So, for a 50% confidence interval, you could look at the first word -- if it is a typo, boom, done. Otherwise, flip a coin.

This stinks. Estimating is about approximating an answer with low information -- it's about efficiency of using data, not _only_ doing better than guessing.


Sorry, I meant once you've examined n trials, what is your 50% confidence interval about the odds for a single trial? I think it would be 1/n.


Yes. That's a very reasonable choice.


Intuitively, this seems like an obverse of the "optimal stopping problem." https://en.wikipedia.org/wiki/Optimal_stopping or the subset of the Odds Algorithm (https://en.wikipedia.org/wiki/Odds_algorithm) where optimal stopping point to find the highest quality item in a sample is essentially N * 1/e = 0.368...

Handwavily, this resembles the rule of three, where we could probably say instead, "the rule of 1/e," because both of these appear to be artifacts of the same relationship and same type of problem.


So, according to this rule, if I wait for 10 minutes for a bus to come and none does, and then I wait for another 10 minutes for an alien invasion and none happens, the two have the same upper bound on their probability?

Or are we going to start talking about priors, on buses and alien invasions, in which case the rule of three is not really useful? If I want to know how likely a specific book is to have typos, can't I just go look for statistics on typoes in books, and won't that give me a better estimate than a "rule" that will give the same results no matter what it is that it's trying to model?


This is the scenario where you know nothing else other than you waited 10 minutes for this event and it didn't happen.

Sure, if you have more data, then you can get much tighter bounds for your estimate.


This is the scenario where you know nothing else other than you waited 10 minutes

To make the math work, I think you need to make several other assumptions. Don't you also have to know that that ten minutes you sampled are representative? That the events (if they did occur) are independent? It seems odd that you'd in a situation where you can rely on specific assumptions like these while also believing that buses and aliens are equally likely to appear.


You always have more data -background knowledge- unless you've only existed in those last 10 minutes.

And if something has really never happened before, like my alien invasion example, what have we learned by applying the rule of three?

Honestly- perform the experiment yourself. Wait for X time, then calculate 3/X. Do you now have an upper bound on the probability that an alien invasion will happen?


>Wait for X time, then calculate 3/X. Do you now have an upper bound on the probability that an alien invasion will happen?

Assuming we would observe an alien invasion and write about it, we have about 5000 years since humans started writing. So 3/5000a, so an upper bound on an alien invasion in a year (assuming p hasn't changed since we started writing) is 0,06% per year.

It's an upper bound in any case, meaning that is this or less. That includes p = 0. You're just 95% confident that it won't be higher than that.

But maybe the aliens brought us our writing system.


Yes, I do have a reasonable (95% confidence, as per article) upper bound on the probability - as I haven't noticed an alien invasion during my lifetime, it seems reasonable to conclude that noticeable alien invasions are very rare events and happen less frequently than once every decade. Possibly much less frequently, possibly many orders of magnitude less frequently, possibly never, but that's how upper bounds work.


An upper bound of "maybe, who knows" is not useful or informative enough to make a whole "rule" about it.


>the two have the same upper bound on their probability? Using only that information, yes. Replace the names of the two events with "event A" and "event B" and it doesn't sound so weird.


The rule of three requires quite a lot of assumptions about the nature of the phenomena.

Or as Taleb says it:

> Consider a turkey that is fed every day. Every single feeding will firm up the bird's belief that it is the general rule of life to be fed every day by friendly members of the human race "looking out for its best interests," as a politician would say.

> On the afternoon of the Wednesday before Thanksgiving, something unexpected will happen to the turkey. It will incur a revision of belief.


It certainly does not require unwarranted assumptions, the proposed approach is consistent with the well-known turkey scenario.

From the experience of feeding a turkey can infer that being slaughtered is a rare event, and that the likelihood it happening exactly tomorrow (without having access to a calendar) is not necessarily 0, but is below a certain rate - and it's definitely not likely to happen three times in the next week. Surviving for 180 days is reasonable justification to assume that, on average, turkeys get slaughtered less frequently than every 60 days, i.e. that the likelihood of Thanksgiving suddenly arriving tomorrow isn't 50% but rather something below 2%.


Completely wrong. Continued survival of given that gives no information at all of survival rates and its distribution. This is because variability of data input is extremely low so mutual information between each of the days is vanishingly small.

This is why a good experimental design will observe a measurement with some expected variability.

An even better trick is the sleeping beauty problem. To solve it you need external information.


I wouldn't use a probability density of typo's per page, but instead the probability that a word is spelled wrong.

Then it's like drawing marbles from a vase containing an unknown proportion of blue and red marbles.

I would use the formula (M+1)/(N+2) where N is the number of words, and M is the number of mistakes. Note that for a large corpus (M+1)/(N+2) approaches -> M/N, so we recover the frequentist probability.

Also note the author (John D Cook) correctly expresses intuitive doubt that 0 typos / 20 pages can not be construed as certainty of no mistakes. Similarily, seeing a mistake in every word in a subset does not guarantee that all words in the book will have a typo. Let's look at the modified formula (M+1)/(N+2) in these cases: if we observe no typos (M=0) in 1000 words, it estimates the typo probability as (0+1)/(1000+2)=1/1002 != 0, so we can't rule out mistakes. Similarily if all words were typos (M=N) for the first 1000 words we get 1001/1002 != 1, so we can't be sure every future word is a typo.

Check out Norman D Megill's paper on "Estimating Bernoulli trial probability from a small sample" where the formula (M+1)/(N+2) is derived, on page 3 it appears:

https://arxiv.org/abs/1105.1486


> If the sight of math makes you squeamish, you might want to stop reading now.

Sigh…another article normalizing the concept that math is something it’s OK to be uncomfortable about…


I'd be interested to know how to estimate things that are very rare or don't have a normal distribution.

For instance, let's say I have a bold plan to protect us from meteor strikes that will cost $100B. How would a person make a decision about whether that's a good trade-off or not? And how would a mathematician help them make that decision?

How would it change for more complex cases, like a shield to prevent nuclear ICBM warfare which has never happened, but we all are worried about?


Simple answer, this is a rough upper bound. The more information about your problem you have, the better you will be able to model the probability distribution and the better priors you have on it happening.


In the example given, the author says that the odds of a given page having a typo is less than 3/20. Sure, but if we don't want a range, but an exact number? That sounds like a more interesting challenge to me.

Formal statement:

- You have observed N events, with 0 occurrences of X

- Someone wants to make a bet with you about the likelihood of X happening

- Once you've quoted a number, your counter-party then has the option of making an even bet about whether the actual likelihood is greater than or less than your prediction

Eg: If you predict 3/N using a 95% confidence interval, then 95% of the time, the actual likelihood will be less than 3/N. Your counterparty will then win the bet 95% of the time, simply by predicting it to be lower.

Your ideal strategy would be to quote a likelihood which is over/under 50% of the time, not 95%.

Ie, you want to pick E such that 50% of the time, it matches the observation you've made (no occurrences), and 50% of the time it does not.

E^N = 0.5

N log E = log 0.5

log E == log 0.5 / N

E = 0.5^(1/N)

For the example given, that comes out to 0.966. Ie, there's a 96.6% chance of no typos in a given page. Across 20 pages, this comes out to 0.966^20 => 50% chance of no typos. If your goal is to quote the single best estimate which can hold up well in a betting market, I believe this would be the ideal strategy



This presumes conditional independency - violations of which are common.

Instead, you get to estimate the dependency between each observation as in advanced variants of Bayes chain rule. Ultimately, some place of the estimator will contain an assumption giving only bounded optimality.


I don't think you can do this. You have a reasonably good upper bound of the probability, but you don't have any justification for putting any lower bound other than zero on the probability.

In particular, if you're considering the probability of a catastrophic event, it's probably better to find some other rationale for estimating the probability than just saying 'it has never happened before'.


Exactly, the maximum likelihood estimator given your data is 0. You can just have a looser or stricter upper bound, depending on your requirements. However without a better model of your distribution and/or better priors (experience), you won't get a tighter estimate.


Unmentioned assumption of normal distributed errors is pretty evil.

This is exactly this approach is worthless for rare events which by definition have very skewed distributions. In that case, probably is overestimated a lot.

On the other hand, I'd the is a rare but systematic error, the error probably will likely be grossly underestimated.

Thought experiment: suppose you're writing a long string of digits that consists of 1 followed by a large number of zeroes (say 99 for simplicity) followed by (say 10000) uniformly distributed digits.

Your writing system has an issue that changes half of 5 digits into 6. What probability of error will be estimated by this dumb method after 100th digit?

Correct bayesian approach updates the prior based on input variability keeping the error estimates high when input has low variability etc. (This can be with variance or another method.) The even better method tries to estimate the shape of input distribution.

In other words, your result would be a difference of likelihood ratio of both input and output prior distributions (estimated to date - since no errors the ratio would be 1) minus likelihood ratio of posterior distributions.


This reminds me of an 1999 article in the New Yorker by Tim Ferris called "How to Predict Anything" https://www.newyorker.com/magazine/1999/07/12/how-to-predict...

> Princeton physicist J. Richard Gott III has an all-purpose method for estimating how long things will last. In particular, he has estimated that, with 95% confidence, humans are going to be around at least fifty-one hundred years, but less than 7.8 million years. Gott calls his procedure the Copernican method, a reference to Copernicus' observation that there is nothing special about the place of the earth in the universe. Not being special plays a key role in Gott's method.


This is cool.

The given example of typos on page kind of highlights the fact that more sophisticated math involving a prior might give better results in some cases.

The beta(1, N+1) prior is an assumption that you start with the a priori knowledge that a typo rate of 1% and a typo rate of 99% are equally as likely as each other.

Most people would assume that books don't have typos on most pages and a 99% typo rate is unlikely.

However as your sample gets bigger the prior matters less and less so this rule is still useful.

Just know that it is reasonable to bias the results a bit according to your prior when N is small.


If you’re wondering where does the formula (1-p)^n come from, it’s a number often used in gambling (if I throw a die 7 times, what are the chances of getting at least a 3). The probability of an event having probability p happening after n trials is 1-(1-p)^n, and he’s using the inverse of that.

https://en.m.wikipedia.org/wiki/Binomial_distribution


This is a useful rule of thumb.

Lets try stretching it: Humans havent destroyed the world in their 300,000 years of existence, so probability of them destroying the planet in future is less than 0.00001 (1 in 100,000). I feel like that might be an underestimate.

Reminds me of the Doomsday Argument

https://en.wikipedia.org/wiki/Doomsday_argument


I’m not very good at maths, so I didn’t understand the whole post. However, does the size of the whole population affect the “3/n” thing? For example, if I’ve read 200 pages of a 201 page book and not discovered a typo the chances are 3/200 if I’ve understand the post correctly. If the book has 20000 pages, is the probability still 3/200?


The rule of three gives an upper bound estimate that each page has a 3/200 or 1.5% chance of having a typo. For the 20,000 page book, the probability of typo per page is still 1.5%, but you have to roll the dice 19,800 more times, each time you turn a new page. The formula for the upper bound on probability of getting through all the remaining pages without a typo is

    1 - (1-3/200)^19800
Which of course is basically 1. The lower bound is 0, the upper bound is almost 1, so the confidence interval is [0, 1). Basically, no information at all. So its much to early to conclude that you're not going to find any typos in the entire book.

The best way to state your conclusions is to say "after sampling 200 pages at random, we are 95% confident that the true typo-per-page rate is in the interval 0% - 1.5% or equivalently the total number of typos in the 20,000 page book is between 0 and 300."


Minor correction to your otherwise excellent final statement:

"... or equivalently the total number of [pages with] typos in the 20,000 page book is between 0 and 300."

We're not measuring the number of typos on a page, but only whether the page contains a typo or not. So we can't speak about the number of typos.


The math in that article is based on the assumption that typos are scarce, because that's the situation where it makes sense that you've seen 0 events. If that probability is close to zero, the probability of two typos per page is insignificantly small and can be discarded, and the expected number of typos in the book is approximately equal to the expected number of pages with typos. So in the context of these assumptions it's not necessary to consider this difference.

Just as the article itself states regarding the approximation "Since log(1-p) is approximately –p for small values of p" which relies on discarding p-squared terms as insignificantly small.


There's no reason to be imprecise and potentially sloppy over a two-word difference. Especially when the comment is being targeted to explain something to someone who did not have a solid grasp of the material, which is obviously necessary to make a leap such as yours.

And I won't even bother with the argument that presence of one typo on a page might actually increase the probability of another typo on the same page or nearby pages. The idea that this principle is based on uniform distribution is discussed enough in other threads.


After 200 pages you could estimate the probability that the next PAGE has a typo. You could use the number of pages left in the book to estimate how many pages may have typos.


The article is confusing, and the rule doesn't actually have anything to do with the odds of never finding an error. This rule lets you calculate a likely ceiling on the probability of there being an error on the next page, assuming you make some assumptions that nothing tricky is going on.

So after reading 20 pages with no errors and having no other information the odds of finding an error on the next page are no more than 1/60. If you have 19,800 more pages left you'd expect to find no more than about 330 errors.


I believe the reason for bringing in the idea of there being no typos in the entire book is to show how you aren't able to reason well without some change in your approach.

Of course, the comments here suggest that it just makes the whole concept confusing to an audience who should either already know about confidence intervals or be well prepared to learn about them.


N is the sample size that you’ve sampled already in your observation.

If you’ve sampled all the pages, then we are talking about certainty which this wouldn’t apply.


The assertion was that the probability is less than 3/N.


In my example, there’s either 1 page or 19800 pages left. But the post suggests the probability of a typo is the same in both cases.


It's the probability of finding a typo on a page, not the remainder of the book.


Ah now I see the difference, thank you.


> It says that if you’ve tested N cases and haven’t found what you’re looking for, a reasonable estimate is that the probability is less than 3/N.

So in your example, according to the rule, the probability of errors is less than 3/200. Which it is, either 0/201 or 1/201.


Although interesting, the article doesn’t relate to predicting things that haven’t happened yet, just things that aren’t known yet.

When predicting things that haven’t happened yet, a publicized “certain” prediction will inevitably influence the actual probability in an unpredictable way.


This post is of course talking about the difference between MLE and MAP estimation. Consider the converse case: you flip a coin once and observe it is heads. Do you then conclude that the probability of heads is 100%? No, because even though you have data supporting that claim, you also have a strong prior belief in what your probability of heads should roughly be. This is encoded in the beta distribution, as mentioned in the article


This is very similar to the sunrise problem. Laplace said the probability of the sun rising tomorrow is (k+1)/(k+2), where k is the number of days we know the sun has risen consecutively, if we always saw it rise, and if we don't have any other information.

https://en.m.wikipedia.org/wiki/Sunrise_problem


I don't know the first thing about schools of thought in statistics (frequentist? Bayesian?) but something feels fishy about extrapolating a probability based on sample size (or number of trials) alone. 20 pages, no typos, <15% chance of flawless spelling? What if the manuscript is 2×10⁴¹ pages long?

Would someone care to explain why the math says it shouldn't matter (for reasonably small values of p)?


><15% chance of flawless spelling?

per page


Clearly this is a bad estimating rule for some processes. The one that popped into my mind is the probability of earthquakes (guess where I live...). This would have the probability of an earthquake declining with each quake-free year that passes, where in fact the USGS would say the opposite.


There's something that feels different about that. You're talking about the rate of some event occurring (like a Poisson process), and measuring how many occurrences are in an interval (a year). What are the 6 samples you are collecting? 6 years of numbers of earthquakes in a year?

Edit: sorry I thought you were replying to another comment.


I feel like I'm having a math stroke.

The posterior probability of p being less than 3/N for Beta(1, N+1) should be integral(Beta(1, N+1), 0, 3/N), right?

That trends toward zero, so I must be wrong, but I can't for the life of me remember why.

EDIT: Ah! I was accidentally using Beta instead of the PDF for Beta.


This rule of 3 may be a good example for reading books and finding typos. It is no where as good for estimating the chances of "something that hasn't happened". It's so random using this rule & claiming the result as an estimate.


There's a trivial corollary, which I remember from my first physics class. Never base anything on less than three measurements. Or maybe it wasn't even physics, but rather carpentry. The old "Measure twice, cut once." rule is iffy.


Three data points is not the same thing as three measurements of the same object.


There's no question that three measurements of some property of an object are three data points. I vaguely recall that my first physics lab experiment was measuring something with a ruler.


" Or maybe it wasn't even physics, but rather carpentry. "

That's the quote of the day.

My father was a carpenter, and every day I think software has more in common with carpentry than computer science.


This feels related, and is interesting:

https://en.wikipedia.org/wiki/Sunrise_problem

(estimating the probability that the sun will rise tomorrow)


related question: probability that the sun will rise again tomorrow

https://en.m.wikipedia.org/wiki/Sunrise_problem

A practical use of a Stilton to this is included in reddits "best" sorting of comments, leaving from room for doubt with low sample sizes of votes.

https://redditblog.com/2009/10/15/reddits-new-comment-sortin...


Mildly interesting trivia: there's a Chinese idiom stating "compare 3 shops for a good deal" (貨比三家不吃虧), I guess that makes sense from a statistical standpoint. :)


I was just on a flight, hanging with a guy who's PhD was this topic. Josh is that you?? What are the odds.


Interesting: the frequentist derivation is using the logarithm, while the Bayesian one, the exponent.


But note that log and exp are inverses of each other, and it's applied to opposite sides of the equation. In particular, this:

    1 - exp(-3) ≈ 0.95
Can be rewritten as:

    -3 ≈ log(1 - 0.95)


(1-p)^n is one my favorite things. happy to see it get some attention.


Murphy’s law “Anything that can go wrong will go wrong”


This is from 2010.


Oh, numeric astrology and probability tantras.

Future is not predictable by definition. It is just an abstract concept, a projection of the mind. Any modeling, however close to reality it might seem to be, is disconnected from it, like a movie or a cartoon. Following complex probabilistic inferences based on sophisticated models is like to act in life guided by movies or tantric literature (unless you are Goldman Sachs, of course).

For a fully observable, discrete, fully deterministic models, such as dice or a deck of cards probability could only say how likely a certain outcome might be, but not (and never) what exactly the next outcome would be.

Estimation of anything about non-fully-observable, partially-deterministic environments is a fucking numeric astrology with cosplay of being a math genius.

No matter what kind of math you pile up - equations from thermodynamics, gaussian distributions or what not it is still disconnected from reality stories, like the ones in tantras.


i think you get more bang for your buck if you try to understand the mechanics that generate successes and failures. assuming a flat prior is crazy in almost every real world case. in otherwords, i think effort is probably better spent understanding the problem rather than understanding how to make the most of ignorance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: