A Bayesian Approach to A/B Testing

zeroonetwothree · on May 4, 2012

There's a related notion of an "adaptive" statistical design, where the allocation of users to each test group varies based on prior performance of the group. For example, if after the first 100 users you notice that group A seems to be doing slightly better than group B, you will favor it by allocation more users to that group. You can compute this allocation in such a way as to maximize the number of successes. In particular it will converge to always picking the better approach eventually, assuming there is a real difference. This also means you don't really need to "stop" the experiment to make everyone use the better version (although you may want to for other reasons).

Here is one paper: http://web.eecs.umich.edu/~qstout/pap/SciProg00.pdf

aaronjg · on May 4, 2012

Noel Welsh at untyped has been working on a cool implementation of this adaptive design.

Check it out http://untyped.com/untyping/2011/02/11/stop-ab-testing-and-m...

And the HN discussion: http://news.ycombinator.com/item?id=3867380

ced · on May 5, 2012

If you aim to make inferences about which ideas work best, you should pick a sample size prior to the experiment and run the experiment until the sample size is reached.

That's not a very Bayesian thing to say. It doesn't matter what sample size you decided to pick at the beginning. A Bayesian method should yield reasonable results at every step of the experiment, and allows you to keep on testing until you feel comfortable with the posterior probability distributions.

If 10 customers have converted so far, and 30 haven't, then you would expect the conversion rate to be somewhere between 10% and 40%, as evidenced by this graph of the Beta distribution(10,30):

http://www.wolframalpha.com/input/?i=plot+BetaDistribution+1...

You then do the same with method B, and stop testing once the overlap between the two probability distributions looks small enough.

Anscombe's rule is interesting, but it seems rather critically dependent on the number of future customers, which is hard to estimate. The advantage of the visual approach outlined above is that it's more intuitive, and people can use their best judgment to decide whether to keep on testing or not.

Disclaimer: I am not an A/B tester.

noelwelsh · on May 5, 2012

This way of framing the problem is known as the bandit problem. You can find lots of papers about it (Bayesian and frequentist). As others have mentioned in this thread we have a startup providing bandit algorithms as SaaS: http://mynaweb.com/

ced · on May 5, 2012

I've looked at your website, and from what I gathered, I would make the same criticism as I made for Anscombe's rule: it's not easy at all to decide what rewards should be, and how to put a price on exploration vs. exploitation. The more I think about it, the more I feel that an engineer looking at Beta distributions could weigh the trade-offs and make a better decision than a black-box algorithm with inadequate assumptions.

Granted, this doesn't really scale to testing many combinations of feature, and I think that I can see what you're shooting for. Best of luck with Myna.

prosa · on May 4, 2012

This is a powerful approach when you can quantify your regret. For many startups, however, it's important to understand the tradeoffs involved in moving one metric upward or downward. To take Zynga as an example, they care about virality at least as much as engagement (or perhaps moreso). Adding or removing a friendspam dialog is likely to trade some virality for user experience. What percentages make or break the decision? Sometimes this is a qualitative call.

In environments where you need to look at the impact of your experiments across multiple variables, and make a subjective call about the tradeoffs, it's really important to have statistical confidence in the movement of each variable you're evaluating. This is a key strength of the traditional A/B testing approach.

cmansley · on May 4, 2012

May I ask how this is Bayesian in anyway? I understand that using the term Bayesian is good for directing clicks to a site, but this seems like good old fashion frequentist math. None of the hallmarks of a Bayesian approach to the problem are here: having a distribution over hypothesis, having an explicit prior, computing the posterior of the distributions.

I have some experience with the medical trial literature and specifically bandit algorithms and using cumulative regret verses other statistical measures like PAC frameworks. And regret is most certainly not a Bayesian idea. Instead you are explicitly modeling the cost of each action (providing an A or B test to a user) instead of assuming all costs are equal.

Yes, this is a better approach because it explicitly models the costs associated with the exploration/exploitation dilemma. But, it is not Bayesian.

aaronjg · on May 4, 2012

In the clinical trial literature Anscombe's approach is considered Bayesian, and Armitage is frequentist. From Armitage's 1963 response to Anscombe's paper:

'Anscombe takes the Bayesian view that inferences should be made in terms of the likelihood function... An immediate consequence is that stopping-rules are irrelevant to the inference problem.'

Page 6 of the Anscombe paper that I cited may be helpful in your understanding of the approach.

spitfire · on May 4, 2012

This gets into the nitty gritty of running trials (A/B, split testing). If things like this get baked into libraries it has the chance of pushing the state of the art forwards.

Very worthy of an HN post.

EDIT: Actually, check out their entire blog. It's worth your time.

aaronjg · on May 4, 2012

What libaries are you currently using where you would like to have things like this?

roryokane · on May 5, 2012

This description of content optimization using bandit algorithms sounds like an even better approach: http://untyped.com/untyping/2011/02/11/stop-ab-testing-and-m...

That company has already made a web app and service to optimize content using that approach, Myna, at http://www.mynaweb.com/. A simulated experiment showed their approach to be better than A/B testing: http://www.mynaweb.com/blog/2011/09/13/myna-vs-ab.html. Though Myna's website doesn't say whether it is currently free or not, or what its pricing will be when it goes out of beta.

noelwelsh · on May 5, 2012

As of yesterday evening, Myna is in public beta. That is, you can sign up straight from the website. Myna is completely free for now. When we start charging, the cost will be in line with other companies in the same space. If you're earning from your site the cost of Myna should be a rounding error.

tel · on May 4, 2012

It's worth noting that considering the zoomed in graph (the 4th image), while it shows correctly that it could cause problems if you use significance as a stopping rule, also clearly shows that the classical test is far more powerful for n < 2000, i.e. it states a result is significant with more sensitivity.

So while Anscombe's rule looks good for massive amounts of users, smaller tests with predefined stopping rules can be more useful if you only have a few thousand observations.

LinaLauneBaer · on May 4, 2012

"k is the expected number of future users who will be exposed to a result"

Does this mean that this approach does not make much sense if your estimate of k is totally wrong?

How do you estimate k?

aaronjg · on May 4, 2012

Anscombe talks about this and proposes two solutions:

One is to estimate it based on the number of daily visitors your site gets, and then estimate how long you will run the winning alternative in the campaign.

He also proposes: 'perhaps k should be assesed, not as a constante, but as an increasing function of |y|/n, since the more striking the treatment difference indicated the more likely it is that the experiment will be noticed... One way of introducing such a dependence of k on |y|/n is to assess k+2n as a constant.'

This actually simplifies the math somewhat, and you can see the full details in Anscombe's paper cited in the blog.

dfabulich · on May 5, 2012

I don't get it. What if this is my home page? What if I intend to run the campaign "forever?"

If k is an estimate of how much traffic I will ever see, it seems like I'm going to be calculating the Phi-inverse of approximately 0.

Where can I see Anscombe's paper online? (It was published in 1963; it's not linked in the blog post, just cited.)

aaronjg · on May 5, 2012

Just sent you a copy of the paper. If you plan to use the result 'forever' then theoretically you would be willing to sacrifice a huge (infinite) amount of suboptimal performance now, so that you get the correct answer in the for when you decide to pick the winning idea. It would be very important to have the correct winning idea, because it is going to run for eternity.

In practice, we don't actually ever run the winning idea for ever. We do website re-designs periodically, we test new ideas, business needs change. So we can pick a reasonable value for k based on these constraints.

Alternatively, you can get better performance by _not_ picking a stopping criteria, and dynamically choosing which homepage to show. As soon as one idea appears to be doing better, you start showing that to more users. By choosing the appropriate adaptive sampling strategy, you can reduce regret to be less than if you have a constant sampling strategy. However, for many people the adaptive strategy may be more trouble to implement than it is worth.

The most important takeaway is to _not_ use repeated significance tests to determine experiment termination time. Either use the Anscombe bound with an appropriate k, or fix the sample size before starting the experiment.

Cblinks · on May 4, 2012

How many tech companies use the Bayesian Approach rather than the traditional (confidence-testing) Approach?

appleaintbad · on May 5, 2012

There is nothing wrong with studying this approach and trying it out to see whether the interpretations are more helpful. However, insufficient data and insufficient technique are common when studying extremely complex systems; this Bayesian approach makes assumptions that may not be correct.

chimeracoder · on May 5, 2012

> However, insufficient data and insufficient technique are common when studying extremely complex systems;

A Bayesian approach is especially well-suited to small sample sizes, unlike a frequentist approach, which will give a nonsensical result for a sample size of 0,1, or 2.

As for improper technique, I can't help you there. 'Garbage in, garbage out', as they always say.