Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Resources about math behind A/B testing
207 points by alexmolas 87 days ago | hide | past | favorite | 55 comments
I've been learning more about AB testing during the last months. I've read almost all the work by Evan Miller, and I've enjoyed it a lot. However, I'd like a more structured approach to the topic since sometimes I feel I'm missing some basics. I've good math knowledge and pretty decent stats foundations. Which are your favourite books/papers in this topic?



I don't think the mathematics is what gets most people into trouble. You can get by with relatively primitive maths, and the advanced stuff is really just a small-order-of-magnitude cost optimisation.

What gets people are incorrect procedures. To get a sense of all the ways in which an experiment can go wrong, I'd recommend reading more traditional texts on experimental design, survey research, etc.

- Donald Wheeler's Understanding Variation should be mandatory reading for almost everyone working professionally.

- Deming's Some Theory of Sampling is really good and covers more ground than the title lets on.

- Deming's Sample Design in Business Research I remember being formative for me also, although it was a while since I read it.

- Efron and Tibshirani's Introduction to the Bootstrap gives an intuitive sense of some experimental errors from a different perspective.

I know there's one book covering survey design I really liked but I forget which one it was. Sorry!


I’m also looking for a good resource on survey design. If you remember the book, please let us know! :)


I know the Deming books are written to a large extent from the perspective of surveys, but they are mainly technical.

I have also read Robinson's Designing Quality Survey Questions which I remember as good, but perhaps not as deep as I had hoped. I don't think that's the one I'm thinking of, unfortunately.

It's highly possible I'm confabulating a book from a variety of sources also...


I've seen procedural errors over and over. Such errors often come down to our temptation to see numbers as objective truth that doesn't require deeper thought. In very narrowly defined scopes, that might be true? But in complicated matters, that rarely holds and it's up to us to keep ourselves honest.

For example, if an experiment runs for a while and there is no statistically significant difference between cohorts, what do we do? It's a tie, but so often the question gets asked "which cohort is 'directionally' better?" The idea is that we don't know how much better it is, but whichever is ahead must still be better. This reasoning doesn't work unless there is something unusual going on at 0 difference in a specific case. Many of us are not comfortable with the idea of a statistical tie (e.g., the 2000 US Presidential election - the outcome was within the margin of error for counting votes and there is no procedure for handling that, and irrationality ensued). So the cohort that's ahead must be better, even if it's not statistically significant, right? We don't know if it is or it's not, but declaring it so satisfies our need for simplicity and order.

Ties should be acknowledged, and tie breakers should be used and be something of value. Which cohort is easier to build future improvements upon? Which is easier to maintain? Easier to understand? Cheaper to operate? Things like that make good tie breakers. And it's worth a check that there wasn't a bug that made the cohorts have identical behavior.

Another example of a procedural mistake is shipping a cohort the moment a significant difference is seen. Take the case of changing a button location. It's possible that the new location is much better. But the first week of the experiment might show that the status quo is better. Why? Users had long ago memorized the old location and expect it to be there. Now they have to find the new location. So the new location might perform worse initially, but be better in the steady state. If you aren't thinking things like that through (e.g., "Does one cohort have a short term advantage for some reason? Is that advantage a good thing?") and move too quickly, you'll be led astray.

To bring this last one back to the math a little more closely. A core issue here is that we have to use imperfect proxy metrics. We want to measure which button location is "better". "Better" isn't quantifiable. It has side effects, so we try to measure those. Does the user click more quickly, or more often, or buy more stuff, or ...? That doesn't mean better, but we hope it is caused by better. But maybe only in the long run as outlined above. Many experiments in for-profit corporate settings would ideally be measured in infinite time horizon profit. But we can't measure or act on that, so we have to pick proxies, and proxies have issues that require careful thought and careful handling.


> Take the case of changing a button location. It's possible that the new location is much better. But the first week of the experiment might show that the status quo is better.

In the 2000s I used this effect to make changes to AdSense text colors. People ignored the add but if I changed the colors on some cadence more people clicked on them. Measurable difference in income.


Hi

Have you looked into these two?

- Trustworthy Online Controlled Experiments by Kohavi, Tang, and Xu

- Statistical Methods in Online A/B Testing by Georgi Georgiev

Recommended by stats stackexchange (https://stats.stackexchange.com/questions/546617/how-can-i-l...)

There's a bunch of other books/courses/videos on o'reilly.

Another potential way to approach this learning goal is to look at Evan's tools (https://www.evanmiller.org/ab-testing/) and go into each one and then look at the JS code for running the tools online.

See if you can go through and comment/write out your thoughts on why it's written that way. of course, you'll have to know some JS for that, but it might be helpful to go through a file like (https://www.evanmiller.org/ab-testing/sample-size.js) and figure out what math is being done.


PS - if you are looking for more of the academic side (cutting edge, much harder statistics), you can start to look at recent work people are doing with A/B tests like this paper -> https://arxiv.org/abs/2002.05670


Even more!

Have you seen this video - https://www.nber.org/lecture/2024-methods-lecture-susan-athe...

Might be interesting to you.


I’ll second Trustworthy Online Controlled Experiments. Fantastic read and Ron Kohavi is worth a follow on LinkedIn as he’s quite active there and usually sharing some interesting insights (or politely pointing out poor practices).


speaking of Georgi Georgiev, I can’t recommend enough his AB testing tools at https://www.analytics-toolkit.com

being able to tell when an experiment has entered the Zone of Futility has been super valuable.


Early in the A-B craze (optimal shade of blue nonsense), I was talking to someone high up with an online hotel reservation company who was telling me how great A-B testing had been for them. I asked him how they chose stopping point/sample size. He told me experiments continued until they observed a statistically significant difference between the two conditions.

The arithmetic is simple and cheap. Understanding basic intro stats principles, priceless.


> He told me experiments continued until they observed a statistically significant difference between the two conditions.

Apparently, if you do the observing the right way, that is a sound way to do that. https://en.wikipedia.org/wiki/E-values:

“We say that testing based on e-values remains safe (Type-I valid) under optional continuation.”


This is correct. There's been a lot of interest in e-values and non-parametric confidence sequences in recent literature. It's usually refered to as anytime-valid inference [1]. Evan Miller explored a similar idea in [2]. For some practical examples, see my Python library [3] implementing multinomial and time inhomogeneous Bernoulli / Poisson process tests based in [4]. See [5] for linear models / t-tests.

[1] https://arxiv.org/abs/2210.0194

[2] https://www.evanmiller.org/sequential-ab-testing.html

[3] https://github.com/assuncaolfi/savvi/

[4] https://openreview.net/forum?id=a4zg0jiuVi

[5] https://arxiv.org/abs/2210.08589


Did you link the thing that you intended to for [1]? I can't find anything about "anytime-valid inference" there.


Thanks for noting! This is the right link for [1]: https://arxiv.org/abs/2210.01948


Sounds like you already know this, but that's not great and will give a lot of false positives. In science this is called p-level hacking. The rigorous way to use hypothesis to testing is to calculate the sample size for the expected effect size and only one test when this sample size is achieved. But this requires knowing the effect size.

If you are doing a lot of significance tests you need to adjust the p-level to divide by the number of implicit comparisons, so e.g. only accept p<0.001 if running ine test per day.

Alternatively just do thompson sampling until one variant dominates.


To expand, p value tells you significance (more precisely the likelihood of the effect if there were no underlying difference). But if you observe it over and over again and pay attention to one value, you've subverted the measure.

Thompson/multi-armed bandit optimizes for outcome over the duration of the test, by progressively altering the treatment %. The test runs longer, but yields better outcomes while doing it.

It's objectively a better way to optimize, unless there is time-based overhead to the existence of the A/B test itself. (E.g. maintaining two code paths.)


I just wanted to affirm what you are doing here.

A key point here is that P-Values optimize for detection of effects if you do everything right, which is not common as you point out.

> Thompson/multi-armed bandit optimizes for outcome over the duration of the test.

Exactly.


The p value is the risk of getting an effect specifically due to sampling error, under the assumption of perfectly random sampling with no real effect. It says very little.

In particular, if you aren't doing perfectly random sampling it is meaningless. If you are concerned about other types of error than sampling error it is meaningless.

A significant p-value is nowhere near proof of effect. All it does is suggestively wiggle its eyebrows in the direction of further research.


> likelihood of the effect if there were no underlying difference

By "effect" I mean "observed effect"; i.e. how likely are those results, assuming the null hypothesis.


Many years ago I was working for a large gaming company and I was the one who developed a very optimal and cheap way to split any cluster of users into A/B groups. The company was extremely happy with how well that worked. However I did some investigation on my own a year later to see how the business development people were using it and... Yeah, pretty much what you said. They were literally brute forcing different configuration until they(more or less) got the desired results.


Microsoft has a seed finder specifically aimed at avoiding a priori bias in experiment groups, but IMO the main effect is pushing whales (which are possibly bots) into different groups until the bias evens out.

I find it hard to imagine obtaining much bias from a random hash seed in a large group of small-scale users, but I haven't looked at the problem closely.


We definitely saw bias, and it made experiments hard to launch until the system started pre-identifying unbiased population samples ahead of time, so the experiment could just pull pre-vetted users.


This is form of "interim analysis" [1].

[1] https://en.wikipedia.org/wiki/Interim_analysis


And yet this is the default. As commonly implemented, a/b testing is an excellent way to look busy, and people will actively resist changing processes to make them more reliable.

I think this is not unrelated to the fact that if you wait long enough you can get a positive signal from a neutral intervention, so you can literally shuffle chairs on the Titanic and claim success. The incentives are against accuracy because nobody wants to be told that the feature they've just had the team building for 3 months had no effect whatsoever.


This is surely more optimal if you do the statistics right? I mean I'm sure they didn't but the intuition that you can stop once there's sufficient evidence is correct.


Bear in mind many people aren’t doing the statistics right.

I’m not an expert but my understanding is that it’s doable if you’re calculating the correct MDE based on the observed sample size, though not ideal (because sometimes the observed sample is too small and there’s no way round that).

I suspect the problem comes when people don’t adjust the MDE properly for the smaller sample. Tools help but you’ve gotta know about them and use them ;)

Personally I’d prefer to avoid this and be a bit more strict due to something a PM once said: “If you torture the data long enough, it’ll show you what you want to see.”


Perhaps he was using a sequential test.


Which company was this? was it by chance SnapTravel?


Experimentation for Engineers: From A/B Testing to Bayesian Optimization. by David Sweet

This book is really great, and I highly recommend it, it goes broader than A/B, but covers everything quite well from a first principles perspective.

https://www.manning.com/books/experimentation-for-engineers



I am a novice in this domain bu loved the interactive format of tig yog and found it really helpful

Also, this reminds me I need to finish the course!


My blog has tons of articles about A/B testing, with math and Python code to illustrate. Good starting point:

https://bytepawn.com/five-ways-to-reduce-variance-in-ab-test...


Just as some basic context, there are two related approaches to A/B testing. The first comes from statistics, and is going to look like standard hypothesis testing of differences of means or medians. The second is from Machine Learning and is going to discuss multi-armed bandit problems. They are both good and have different tradeoffs. I just wanted you to know that there are two different approaches that are both valid.


I once wanted a structured approach before I had access to large amounts of traffic. Once I had traffic available, the learning naturally happened (background in engineering with advanced math). If you are lucky enough to start learning through hands on experience, I’d check out: https://goodui.org/

I was lucky to get trained well by 100m+ users over the years. If you have a problem you are trying to solve, I’m happy to go over my approach to designing optimization winners repeatedly.

Alex, I will shoot you an email shortly. Also, sebg’s comment is good if you are looking for of the more academic route to learning.


A/B Testing

An interactive look at Thompson sampling

https://everyday-data-science.tigyog.app/a-b-testing


I'd also like to mention the classic book "Reinforcement Learning" by Sutton & Barto, which goes into some relevant mathematical aspects for choosing the "best" among a set of options. They have a full link of the PDF for free on their website [1]. Chapter 2 on "Multi-Armed Bandits" is where to start.

[1] http://incompleteideas.net/book/the-book-2nd.html


If you'd rather go through some of this live, we have a section on Stats for Growth Engineers in the Growth Engineering Course on Reforge (course.alexeymk.com). We talk through stat sig, power analysis, common experimentation footguns and alternate methodologies such as Bayesian, Sequential, and Bandits (which are typically Bayesian). Running next in October.

Other than that, Evan's stuff is great, and the Ron Kohavi book gets a +1, though it is definitely dense.


Do you have another link -that one’s not working for me


For learning about the basics of statistics, my go-to resource is "Discovering Statistics using [R/SPSS]" (Andy Field). "Improving Your Statistical Inferences" (Daniel Lakens) needs some basics, but covers a lot of intesting topics, including sequencial testing and equivalence tests (sometimes you want to know if a new thing is equivalent to the old)


When I used to do A/B testing all results per traffic funnel averaged over time into cumulative results. The tests would run as long as they needed to attain statistical confidence between the funnels where confidence was the ratio of differentiation between results over time after discounting for noise and variance.

Only at test completion were financial projections attributed to test results. Don’t sugar coat it. Let people know up front just how damaging their wonderful business ideas are.

The biggest learning from this is that the financial projections from the tests were always far too optimistic compared to future development in production. The tests were always correct. The cause for the discrepancies were shitty development. If a new initiative to production is defective or slow it will not perform as well as the tests projected. Web development is full of shitty developers who cannot program for the web, and our tests were generally ideal in their execution.


In my experience the most helpful and generalizable resources have been resources on “experimental design” in biology, and textbooks on linear regression in the social sciences. (Why these fields is actually an interesting question but I don’t feel like getting into it.)

A/B tests are just a narrow special case of these.


A/B testing misses the point of statistical design of experiments, which is that your variables can interact. One factor at a time experiments are pretty much guaranteed to stick you in a local maximum


There's that, and also all of the common pitfalls highlighted in this famous paper: https://journals.plos.org/plosmedicine/article?id=10.1371/jo...

I do believe "doing A/B testing" is probably better than "not doing A/B testing", more often than not, but I think non-statisticians are usually way too comfortable with their knowledge (or lack thereof). And I have very little faith in the vast majority of A/B experiments run by people who don't know much about stats.


I really liked Richard McElreath’s Statistical rethinking https://youtube.com/playlist?list=PLDcUM9US4XdPz-KxHM4XHt7uU...


In my experience, there’s just not much depth to the math behind A/B testing. It all comes down to does A or B affect X parameter without negatively affecting Y. This is all basic analysis stuff.

The harsh reality is A/B testing is only an optimization technique. It’s not going to fix fundamental problems with your product or app. In nearly everything I’ve done, it’s been a far better investment to focus on delivering more features and more value. It’s much easier to build a new feature that moves the needle by 1% than it is to polish a turd for 0.5% improvement.

That being said, there are massive exceptions to this. When you’re at scale, fractions of percents can mean multiple millions of dollars of improvements.


I worked for Ron Kohavi - he has a couple books. "Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO", and "Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing". I haven't read the second, but the first is easy to find and peruse.


No specific literature to recommend but understanding sample size and margin of error/confidence interval calculations will really help you understand a/b testing. Beyond a/b, this will help with multivariate testing as well which has mostly replaced a/b in orgs that are serious about testing.


When seeking to both explore better treatments and also exploit good ones the mathematical formalism often used is a “bandit”.

https://en.m.wikipedia.org/wiki/Multi-armed_bandit


One of my fav resources for binomial experiment evaluation + a lot of explanation: https://thumbtack.github.io/abba/demo/abba.html


It's maybe not what most people would recommend, but I'd suggest you read up on regret minimization and best arm indentification for multi-armed bandit problems. That way it'll probably be useful and fun! =)


Growthbook wrote a short paper on how they evaluate test results continuously.

https://docs.growthbook.io/GrowthBookStatsEngine.pdf


Bradley Efron, {\it The Jackknife, the Bootstrap, and Other Resampling Plans,\/} ISBN 0-89871-179-7, SIAM, Philadelphia, 1982.\ \


Anyone have fun examples of A/B tests you’ve run where the results were surprising or hugely lopsided?


So the thing I always ctrl-F for, to see if a paper or course really knows what it's talking about, is called the “multi-armed bandit” problem. Just ctrl-F bandit, if an A/B tutorial is long enough it will usually mention them.

This is not a foolproof method, I'd call it only ±5 dB of evidence, so it would shift a 50% likely that they know what they're talking about to like 75% if present or 25% if absent, but obviously look at the rest of it and see if that's borne out. And to be clear: Even mentioning it if it's just to dismiss it, counts!

So e.g. I remember reading a whitepaper about “A/B Tests are Leading You Astray” and thinking “hey that's a fun idea, yeah, effect size is too often accidentally conditioned on whether the result was judged significantly significant which would be a source of bias” ...and sure enough a sentence came up, just innocently, like, “you might even have a bandit algorithm! But you had to use your judgment to discern that that was appropriate in context.” And it’s like “OK, you know about bandits but you are explicitly interested in human discernment and human decision making, great.” So, +5 dB to you.

And on the flip-side if it makes reference to A/B testing but it's decently long and never mentions bandits then there's only maybe a 25% chance they know what they are talking about. It can still happen, you might see e.g. χ² instead of the t-test [because usually you don't have just “converted” vs “did not convert”... can your analytics grab “thought about it for more than 10s but did not convert” etc.?] or something else that piques interest. Or it's a very short article where it just didn't come up, but that's fine because we are, when reading, performing a secret cost-benefit analysis and short articles have very low cost.

For a non-technical thing you can give to your coworkers, consider https://medium.com/jonathans-musings/ab-testing-101-5576de64...

Researching this comment led to this video which looks interesting and I’ll need to watch later about how you have to pin down the time needed to properly make the choices in A/B testing: https://youtu.be/Fs8mTrkNpfM?si=ghsOgDEpp43yRmd8

Some more academic looking discussions of bandit algorithms that I can't vouch for personally, but would be my first stops:

- https://courses.cs.washington.edu/courses/cse599i/21wi/resou... - https://tor-lattimore.com/downloads/book/book.pdf - http://proceedings.mlr.press/v35/kaufmann14.pdf




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: