Every time you do a significance calculation and decide whether to stop the test or continue you increase the likelihood of a type I error (i.e., false positive).
So, your 99.8% confidence isn't 99.8%.
There are a few ways to compensate for this. The easiest is to fix your sample size and not reach any conclusions until you have tested that many users.
You can determine your sample size by, say, figuring out the minimum number of people needed before there's a 95% chance you observe a 10% effect size.
This is a problem in clinical trials where ethical questions arise. If the test appears harmful, should we stop? If it appears beneficial, isn't it unethical to deprive the control group of the treatment?
Anyhow, if you want to observe the results continuously, you need to use a technique like alpha spending, sequential experimental design, or Bayesean experimental design.
TL;DR: If you're periodically looking at your A/B testing results and deciding whether to continue or stop, you're doing it wrong and your significance level is lower than you think it is.
And this, ladies and gentlemen, is why we use Bayesian likelihood ratios. You can use whatever stopping algorithm you like on likelihood ratios and they will still be valid.
"Statistical significance", not so much. A simple test program I wrote showed that with a limit of 500 coinflips, flipping a fair coin will produce, on at least one out of the steps, a dataset which rejects the null hypothesis with p < .05, fully 30% of the time.
The likelihood ratio is just true no matter how you use it.
The meaning of an experimental result should not depend on the experimenter's private state of mind (such as the stopping rule they followed) if that state of mind does not affect the experimental apparatus. The likelihood ratio is objective, "statistical significance" is not. See link.
(Note: Second paragraph in grandparent was edit, original reply in parent may have been to first paragraph only.)
Well, keeping in mind that experimenters should just report likelihood ratios, I'd have estimated a pretty low prior probability that one of two identical pages produced more signups than the other.
But I suppose that after seeing a likelihood ratio of 1000/1 favoring the hypothesis "each of these pages has a separate fixed conversion rate" over "both of these pages have the same fixed conversion rate", using Laplace's Rule of Succession (i.e. a flat prior between 0 and 1 for the conversion rate) in all cases, I'd think that something was more likely than not going on.
For academic journals, I think that what you propose is totally reasonable. I'll try to give a small taste for those not familiar:
For a new HIV test, I want to know the likelihood ratios, not other parameters which tend to be driven by the background HIV prevalence in the researcher's specific catchment area. My catchment area's HIV prevalence will differ, so other aspects of the test that are dominated by prevalence will not be the same in my population. However, the likelihood ratio will be.
The point that I didn't finish making before getting kicked out of the coffee shop was that, at some point, people will want to compute a posterior, so they'll need to figure out a prior. Your suggestion above is helpful.
Edit - Also, you're right in that I saw just the 1st paragraph when I initially replied.
This is one of the more insightful comments I've read lately. I thought just upvoting wasn't enough for this one, so I wanted to say thanks and I'm going to refine my next split tests along your guidelines.
It's also worth pointing out that you can safely peek and make an early decision if you insist on a higher confidence interval for early stops.
The exact numbers depend on how often you test the confidence interval, but a multiple of 5-10 is for early measures will usually give reasonable results.
E.g., let's say you ultimately want a 98% confidence interval (2% chance of error). A 10x improvement is a .2% change of error, or 99.8%. Depending on how often you are sampling, an early result with 99.8% confidence can be equally accurate as a later results with 98% confidence.
When you take multiple peeks at the data as it begins to accumulate, you're essentially performing multiple testing.
Imagine that you wanted to see if A != B, with a 95% confidence. To oversimplify, this means that you're willing to accept that 1 in 20 times, you'll incorrectly reject the null and you will consider A different from B even though they are truly the same.
If you run 20 independent tests at once, each at a 95% confidence, then by chance you'd expect 1 to reject the null even if they're actually all null.
Now, if you repeatedly peek at the data as it accumulates for one test, you're doing something similar (not 100% the same because the tests aren't totally independent, but similar). To again oversimplify, you'd expect, by chance, to see "significant" results 1 in 20 of the times that you look at the data, even if there is no significant result. This is why you need to wait until you've collected all of the necessary data first, or implement a procedure to protect you from incorrectly seeing "significance" when it's not there. You might, for example, require a more stringent confidence level earlier on, and less stringent ones later.
Yep, that's what I realised from your and jfarmer's comments. It does make sense that you're sampling multiple times and peeking to see if you've reached that, thank you.
Not very many people have studied probability, whereas a lot (relatively) have heard the good effects of A/B testing.
I think A/B testing is a very good method, but you need a lot of data. I'd say, when you're just starting up, don't try A/B. You won't get the data you need.
It's very easy to be seduced by statistics. It doesn't matter if the stats are wrong.
It's very easy to use the wrong statistical method or to do things with your data that you shouldn't, which causes you to misinterpret the output of statistical methods.
Which is more likely: (a) that this was the 1 in 1/(1-0.998) times that the correct method incorrectly rejected the null; (b) that there is actually a cryptic difference between A and B in the AB test; or (c) that the wrong method was used, or the right method was misused, or something along those lines?
I would say that those are arranged in increasing order of likelihood.
There wasn't much to go wrong... The A/B testing system renders one template in two ways depending on whether the user is in the test or control group, and the template was the same in both cases. If we had made a mistake, the results wouldn't normalise when we got more data, they would have just stayed there.
I'm saying that the mistake was overinterpreting the statistics without having a good handle on what they actually meant (part of option 'c' above). This comes off as ad hominem, but it's the most common error mode so I'd use this as a motivation to learn more stats and understand why the test was probably not telling you what you thought it was.
The worst part was that we actually did make decisions based on the confidence metric when it was as low as 95%, sometimes, and then realised that we shouldn't have (we should have waited for more data).
This is what most people who talk about A/B testing don't mention, that you need more data than you think, in order to get good results.
This post would be useful if it told us (1) the actual numbers of visitors vs successes along the timepoints that you mention, and (2) the formula that you used to calculate confidence. Without knowing those 2 facts, it's hard to conclude anything. Also, is that confidence interval corrected for the multiple peeks that you took at the data? Etc. I'd like to know much more about the particulars of the methodology here, because it sounds like statistical methods might have been misunderstood or misused (perhaps by the A/B software or anywhere else along the chain).
I can't share the actual data, but the method of calculating the confidence was the standard formula from calculating the confidence for the Z-score as used in http://abtester.com/calculator/, Google Website optimiser, etc etc. The numbers did check out, it was just an unlikely fluke, which is why I don't trust the confidence interval as much as I used to...
Before running this test, did you estimate the sample size needed to detect the estimated difference in effects from A vs B? Since the expected difference is 0, the necessary sample size would be infinite. Therefore, we could at least estimate a very small effect, which would necessitate a very large sample. I would be very cautious, then, about trying to interpret the "confidence interval" before you had accumulated that large sample size. If you're going to peek before getting to that point, then you should almost certainly have implemented an alpha spending function (or similar) in order to maintain the desired overall error rate.
This should not reduce your belief in confidence intervals; this is a great, motivating opportunity that should prompt you to learn how to use them correctly.
No, we hadn't estimated the necessary sample size. To be honest, nobody expected the confidence metric to go so high for a test with no changes, which is why we were surprised. We've studied various A/B testing resources and we haven't seen this "peeking error" mentioned anywhere, so we'd appreciate any resources you may have (and would possibly summarise them in a subsequent post).
Sure thing. I just posted a bit more detail about this elsewhere in this thread. Also, googling "alpha spending function" brings back some useful results. (I can't paste into the HN text box with the latest Chrome dev, otherwise I'd paste some links here for you.)
"The important lesson here, and the one you should take away from our experience, is this: Whenever you think you have enough data for the A/B test, get more! Sometimes, you will fall into that 0.1%, and your decision will be wrong, and might impact your metrics adversely, and you might never find out."
This is actually terrible advice because continuing a test in which one set is significantly better than another has a cost. You are showing an inferior set to a segment of your users and that costs you money (or signups or whatever metric it is you're improving which, at the end of the day, presumably equates to money).
As an example, suppose you do a test and discover something that doubles your signup rate (and therefore monetization rate) and you've got a confidence level of 99.9%. It's true, there's a 1 in 1,000 chance your result is flawed and you'll end up with the wrong decision. But there's a 999 in 1,000 that you're showing a significantly inferior signup page to half of your customers, costing you about 25% of potential revenue. It doesn't even take someone who knows what EV stands for to realize his EV on ending the test is huge here.
I guess that depends on long-term vs short-term gain. Sure, you'll gain a few days' worth of signups, but you have a small chance to make a decision that will impact every visitor from that point on negatively (which you haven't detected).
Wouldn't that be half of 999 in 1,000 — assuming that 50% are shown v1 and 50% v2 ? That doesn't change your advice. Once you've reached a certain confidence, it's time to make the change.
And the specific reason here (among the ones listed in that article) is almost certainly a lack of data. It's always helpful to estimate your necessary sample size prior to beginning your test, so you don't get too excited when results look odd before you get anywhere near the sample necessary to see an effect of a given size.
Sure, but the way you estimate your sample size is working backwards from the confidence level you want to get. In this case, we were waiting for the confidence metric to tell us when the sample size was appropriate, but it didn't work out as planned.
Hmm, yes, I see now why you're right. I'd like to read a bit more on the theory, though, so we don't make these mistakes again. I'll look around for some resources, thanks again!
But if I flip a coin 10 times and get the queens face (I'm in the UK) 8 of those times, that doesn't mean I'll keep getting head.
It's been quite a lot of years since I was involved in the betting world now, but at the time I was constantly amazed at the amount of people that thought the Martingale system was a winner.
So, your 99.8% confidence isn't 99.8%.
There are a few ways to compensate for this. The easiest is to fix your sample size and not reach any conclusions until you have tested that many users.
You can determine your sample size by, say, figuring out the minimum number of people needed before there's a 95% chance you observe a 10% effect size.
This is a problem in clinical trials where ethical questions arise. If the test appears harmful, should we stop? If it appears beneficial, isn't it unethical to deprive the control group of the treatment?
Anyhow, if you want to observe the results continuously, you need to use a technique like alpha spending, sequential experimental design, or Bayesean experimental design.
TL;DR: If you're periodically looking at your A/B testing results and deciding whether to continue or stop, you're doing it wrong and your significance level is lower than you think it is.