Ask HN: Resources about math behind A/B testing

kqr · 2024-08-10T17:54:47 1723312487

I don't think the mathematics is what gets most people into trouble. You can get by with relatively primitive maths, and the advanced stuff is really just a small-order-of-magnitude cost optimisation.

What gets people are incorrect procedures. To get a sense of all the ways in which an experiment can go wrong, I'd recommend reading more traditional texts on experimental design, survey research, etc.

- Donald Wheeler's Understanding Variation should be mandatory reading for almost everyone working professionally.

- Deming's Some Theory of Sampling is really good and covers more ground than the title lets on.

- Deming's Sample Design in Business Research I remember being formative for me also, although it was a while since I read it.

- Efron and Tibshirani's Introduction to the Bootstrap gives an intuitive sense of some experimental errors from a different perspective.

I know there's one book covering survey design I really liked but I forget which one it was. Sorry!

psawaya · 2024-08-10T18:23:33 1723314213

I’m also looking for a good resource on survey design. If you remember the book, please let us know! :)

kqr · 2024-08-10T18:39:25 1723315165

I know the Deming books are written to a large extent from the perspective of surveys, but they are mainly technical.

I have also read Robinson's Designing Quality Survey Questions which I remember as good, but perhaps not as deep as I had hoped. I don't think that's the one I'm thinking of, unfortunately.

It's highly possible I'm confabulating a book from a variety of sources also...

rented_mule · 2024-08-10T23:46:48 1723333608

I've seen procedural errors over and over. Such errors often come down to our temptation to see numbers as objective truth that doesn't require deeper thought. In very narrowly defined scopes, that might be true? But in complicated matters, that rarely holds and it's up to us to keep ourselves honest.

For example, if an experiment runs for a while and there is no statistically significant difference between cohorts, what do we do? It's a tie, but so often the question gets asked "which cohort is 'directionally' better?" The idea is that we don't know how much better it is, but whichever is ahead must still be better. This reasoning doesn't work unless there is something unusual going on at 0 difference in a specific case. Many of us are not comfortable with the idea of a statistical tie (e.g., the 2000 US Presidential election - the outcome was within the margin of error for counting votes and there is no procedure for handling that, and irrationality ensued). So the cohort that's ahead must be better, even if it's not statistically significant, right? We don't know if it is or it's not, but declaring it so satisfies our need for simplicity and order.

Ties should be acknowledged, and tie breakers should be used and be something of value. Which cohort is easier to build future improvements upon? Which is easier to maintain? Easier to understand? Cheaper to operate? Things like that make good tie breakers. And it's worth a check that there wasn't a bug that made the cohorts have identical behavior.

Another example of a procedural mistake is shipping a cohort the moment a significant difference is seen. Take the case of changing a button location. It's possible that the new location is much better. But the first week of the experiment might show that the status quo is better. Why? Users had long ago memorized the old location and expect it to be there. Now they have to find the new location. So the new location might perform worse initially, but be better in the steady state. If you aren't thinking things like that through (e.g., "Does one cohort have a short term advantage for some reason? Is that advantage a good thing?") and move too quickly, you'll be led astray.

To bring this last one back to the math a little more closely. A core issue here is that we have to use imperfect proxy metrics. We want to measure which button location is "better". "Better" isn't quantifiable. It has side effects, so we try to measure those. Does the user click more quickly, or more often, or buy more stuff, or ...? That doesn't mean better, but we hope it is caused by better. But maybe only in the long run as outlined above. Many experiments in for-profit corporate settings would ideally be measured in infinite time horizon profit. But we can't measure or act on that, so we have to pick proxies, and proxies have issues that require careful thought and careful handling.

sroussey · 2024-08-11T00:51:12 1723337472

> Take the case of changing a button location. It's possible that the new location is much better. But the first week of the experiment might show that the status quo is better.

In the 2000s I used this effect to make changes to AdSense text colors. People ignored the add but if I changed the colors on some cadence more people clicked on them. Measurable difference in income.

sebg · 2024-08-08T21:13:26 1723151606

Hi

Have you looked into these two?

- Trustworthy Online Controlled Experiments by Kohavi, Tang, and Xu

- Statistical Methods in Online A/B Testing by Georgi Georgiev

Recommended by stats stackexchange (https://stats.stackexchange.com/questions/546617/how-can-i-l...)

There's a bunch of other books/courses/videos on o'reilly.

Another potential way to approach this learning goal is to look at Evan's tools (https://www.evanmiller.org/ab-testing/) and go into each one and then look at the JS code for running the tools online.

See if you can go through and comment/write out your thoughts on why it's written that way. of course, you'll have to know some JS for that, but it might be helpful to go through a file like (https://www.evanmiller.org/ab-testing/sample-size.js) and figure out what math is being done.

sebg · 2024-08-08T21:24:34 1723152274

PS - if you are looking for more of the academic side (cutting edge, much harder statistics), you can start to look at recent work people are doing with A/B tests like this paper -> https://arxiv.org/abs/2002.05670

sebg · 2024-08-08T21:52:53 1723153973

Even more!

Have you seen this video - https://www.nber.org/lecture/2024-methods-lecture-susan-athe...

Might be interesting to you.

iamacyborg · 2024-08-10T22:05:00 1723327500

I’ll second Trustworthy Online Controlled Experiments. Fantastic read and Ron Kohavi is worth a follow on LinkedIn as he’s quite active there and usually sharing some interesting insights (or politely pointing out poor practices).

mef · 2024-08-10T23:39:10 1723333150

speaking of Georgi Georgiev, I can’t recommend enough his AB testing tools at https://www.analytics-toolkit.com

being able to tell when an experiment has entered the Zone of Futility has been super valuable.

nanis · 2024-08-10T10:47:22 1723286842

Early in the A-B craze (optimal shade of blue nonsense), I was talking to someone high up with an online hotel reservation company who was telling me how great A-B testing had been for them. I asked him how they chose stopping point/sample size. He told me experiments continued until they observed a statistically significant difference between the two conditions.

The arithmetic is simple and cheap. Understanding basic intro stats principles, priceless.

Someone · 2024-08-10T14:32:38 1723300358

> He told me experiments continued until they observed a statistically significant difference between the two conditions.

Apparently, if you do the observing the right way, that is a sound way to do that. https://en.wikipedia.org/wiki/E-values:

“We say that testing based on e-values remains safe (Type-I valid) under optional continuation.”

gatopingado · 2024-08-10T17:39:30 1723311570

This is correct. There's been a lot of interest in e-values and non-parametric confidence sequences in recent literature. It's usually refered to as anytime-valid inference [1]. Evan Miller explored a similar idea in [2]. For some practical examples, see my Python library [3] implementing multinomial and time inhomogeneous Bernoulli / Poisson process tests based in [4]. See [5] for linear models / t-tests.

[1] https://arxiv.org/abs/2210.0194

[2] https://www.evanmiller.org/sequential-ab-testing.html

[3] https://github.com/assuncaolfi/savvi/

[4] https://openreview.net/forum?id=a4zg0jiuVi

[5] https://arxiv.org/abs/2210.08589

ryan-duve · 2024-08-10T19:46:29 1723319189

Did you link the thing that you intended to for [1]? I can't find anything about "anytime-valid inference" there.

gatopingado · 2024-08-10T20:08:19 1723320499

Thanks for noting! This is the right link for [1]: https://arxiv.org/abs/2210.01948

glutamate · 2024-08-10T13:43:46 1723297426

Sounds like you already know this, but that's not great and will give a lot of false positives. In science this is called p-level hacking. The rigorous way to use hypothesis to testing is to calculate the sample size for the expected effect size and only one test when this sample size is achieved. But this requires knowing the effect size.

If you are doing a lot of significance tests you need to adjust the p-level to divide by the number of implicit comparisons, so e.g. only accept p<0.001 if running ine test per day.

Alternatively just do thompson sampling until one variant dominates.

paulddraper · 2024-08-10T14:42:07 1723300927

To expand, p value tells you significance (more precisely the likelihood of the effect if there were no underlying difference). But if you observe it over and over again and pay attention to one value, you've subverted the measure.

Thompson/multi-armed bandit optimizes for outcome over the duration of the test, by progressively altering the treatment %. The test runs longer, but yields better outcomes while doing it.

It's objectively a better way to optimize, unless there is time-based overhead to the existence of the A/B test itself. (E.g. maintaining two code paths.)

youainti · 2024-08-10T15:10:56 1723302656

I just wanted to affirm what you are doing here.

A key point here is that P-Values optimize for detection of effects if you do everything right, which is not common as you point out.

> Thompson/multi-armed bandit optimizes for outcome over the duration of the test.

Exactly.

kqr · 2024-08-10T17:15:44 1723310144

The p value is the risk of getting an effect specifically due to sampling error, under the assumption of perfectly random sampling with no real effect. It says very little.

In particular, if you aren't doing perfectly random sampling it is meaningless. If you are concerned about other types of error than sampling error it is meaningless.

A significant p-value is nowhere near proof of effect. All it does is suggestively wiggle its eyebrows in the direction of further research.

paulddraper · 2024-08-10T18:28:31 1723314511

> likelihood of the effect if there were no underlying difference

By "effect" I mean "observed effect"; i.e. how likely are those results, assuming the null hypothesis.

axegon_ · 2024-08-10T12:50:42 1723294242

Many years ago I was working for a large gaming company and I was the one who developed a very optimal and cheap way to split any cluster of users into A/B groups. The company was extremely happy with how well that worked. However I did some investigation on my own a year later to see how the business development people were using it and... Yeah, pretty much what you said. They were literally brute forcing different configuration until they(more or less) got the desired results.

kwillets · 2024-08-10T18:23:34 1723314214

Microsoft has a seed finder specifically aimed at avoiding a priori bias in experiment groups, but IMO the main effect is pushing whales (which are possibly bots) into different groups until the bias evens out.

I find it hard to imagine obtaining much bias from a random hash seed in a large group of small-scale users, but I haven't looked at the problem closely.

ec109685 · 2024-08-10T18:53:33 1723316013

We definitely saw bias, and it made experiments hard to launch until the system started pre-identifying unbiased population samples ahead of time, so the experiment could just pull pre-vetted users.

abhgh · 2024-08-10T15:50:51 1723305051

This is form of "interim analysis" [1].

[1] https://en.wikipedia.org/wiki/Interim_analysis

regularfry · 2024-08-10T12:09:54 1723291794

And yet this is the default. As commonly implemented, a/b testing is an excellent way to look busy, and people will actively resist changing processes to make them more reliable.

I think this is not unrelated to the fact that if you wait long enough you can get a positive signal from a neutral intervention, so you can literally shuffle chairs on the Titanic and claim success. The incentives are against accuracy because nobody wants to be told that the feature they've just had the team building for 3 months had no effect whatsoever.

IshKebab · 2024-08-10T17:40:52 1723311652

This is surely more optimal if you do the statistics right? I mean I'm sure they didn't but the intuition that you can stop once there's sufficient evidence is correct.

scott_w · 2024-08-10T17:54:50 1723312490

Bear in mind many people aren’t doing the statistics right.

I’m not an expert but my understanding is that it’s doable if you’re calculating the correct MDE based on the observed sample size, though not ideal (because sometimes the observed sample is too small and there’s no way round that).

I suspect the problem comes when people don’t adjust the MDE properly for the smaller sample. Tools help but you’ve gotta know about them and use them ;)

Personally I’d prefer to avoid this and be a bit more strict due to something a PM once said: “If you torture the data long enough, it’ll show you what you want to see.”

esafak · 2024-08-10T14:03:55 1723298635

Perhaps he was using a sequential test.

sethd1211 · 2024-08-10T20:33:54 1723322034

Which company was this? was it by chance SnapTravel?

phyalow · 2024-08-10T14:03:46 1723298626

Experimentation for Engineers: From A/B Testing to Bayesian Optimization. by David Sweet

This book is really great, and I highly recommend it, it goes broader than A/B, but covers everything quite well from a first principles perspective.

https://www.manning.com/books/experimentation-for-engineers

vismit2000 · 2024-08-10T05:11:58 1723266718

https://everyday-data-science.tigyog.app/a-b-testing

waldothedog · 2024-08-11T01:49:29 1723340969

I am a novice in this domain bu loved the interactive format of tig yog and found it really helpful

Also, this reminds me I need to finish the course!

Maro · 2024-08-10T17:16:01 1723310161

My blog has tons of articles about A/B testing, with math and Python code to illustrate. Good starting point:

https://bytepawn.com/five-ways-to-reduce-variance-in-ab-test...

youainti · 2024-08-10T15:07:40 1723302460

Just as some basic context, there are two related approaches to A/B testing. The first comes from statistics, and is going to look like standard hypothesis testing of differences of means or medians. The second is from Machine Learning and is going to discuss multi-armed bandit problems. They are both good and have different tradeoffs. I just wanted you to know that there are two different approaches that are both valid.

rancar2 · 2024-08-10T15:19:12 1723303152

I once wanted a structured approach before I had access to large amounts of traffic. Once I had traffic available, the learning naturally happened (background in engineering with advanced math). If you are lucky enough to start learning through hands on experience, I’d check out: https://goodui.org/

I was lucky to get trained well by 100m+ users over the years. If you have a problem you are trying to solve, I’m happy to go over my approach to designing optimization winners repeatedly.

Alex, I will shoot you an email shortly. Also, sebg’s comment is good if you are looking for of the more academic route to learning.

nivertech · 2024-08-10T17:00:25 1723309225

A/B Testing

An interactive look at Thompson sampling

https://everyday-data-science.tigyog.app/a-b-testing

gjstein · 2024-08-10T15:26:35 1723303595

I'd also like to mention the classic book "Reinforcement Learning" by Sutton & Barto, which goes into some relevant mathematical aspects for choosing the "best" among a set of options. They have a full link of the PDF for free on their website [1]. Chapter 2 on "Multi-Armed Bandits" is where to start.

[1] http://incompleteideas.net/book/the-book-2nd.html

AlexeyMK · 2024-08-10T16:09:51 1723306191

If you'd rather go through some of this live, we have a section on Stats for Growth Engineers in the Growth Engineering Course on Reforge (course.alexeymk.com). We talk through stat sig, power analysis, common experimentation footguns and alternate methodologies such as Bayesian, Sequential, and Bandits (which are typically Bayesian). Running next in October.

Other than that, Evan's stuff is great, and the Ron Kohavi book gets a +1, though it is definitely dense.

aidos · 2024-08-10T16:20:32 1723306832

Do you have another link -that one’s not working for me

simulo · 2024-08-10T13:19:38 1723295978

For learning about the basics of statistics, my go-to resource is "Discovering Statistics using [R/SPSS]" (Andy Field). "Improving Your Statistical Inferences" (Daniel Lakens) needs some basics, but covers a lot of intesting topics, including sequencial testing and equivalence tests (sometimes you want to know if a new thing is equivalent to the old)

austin-cheney · 2024-08-10T15:55:02 1723305302

When I used to do A/B testing all results per traffic funnel averaged over time into cumulative results. The tests would run as long as they needed to attain statistical confidence between the funnels where confidence was the ratio of differentiation between results over time after discounting for noise and variance.

Only at test completion were financial projections attributed to test results. Don’t sugar coat it. Let people know up front just how damaging their wonderful business ideas are.

The biggest learning from this is that the financial projections from the tests were always far too optimistic compared to future development in production. The tests were always correct. The cause for the discrepancies were shitty development. If a new initiative to production is defective or slow it will not perform as well as the tests projected. Web development is full of shitty developers who cannot program for the web, and our tests were generally ideal in their execution.

epgui · 2024-08-10T20:20:42 1723321242

In my experience the most helpful and generalizable resources have been resources on “experimental design” in biology, and textbooks on linear regression in the social sciences. (Why these fields is actually an interesting question but I don’t feel like getting into it.)

A/B tests are just a narrow special case of these.

sevensor · 2024-08-13T15:47:51 1723564071

A/B testing misses the point of statistical design of experiments, which is that your variables can interact. One factor at a time experiments are pretty much guaranteed to stick you in a local maximum

epgui · 2024-08-15T15:30:44 1723735844

There's that, and also all of the common pitfalls highlighted in this famous paper: https://journals.plos.org/plosmedicine/article?id=10.1371/jo...

I do believe "doing A/B testing" is probably better than "not doing A/B testing", more often than not, but I think non-statisticians are usually way too comfortable with their knowledge (or lack thereof). And I have very little faith in the vast majority of A/B experiments run by people who don't know much about stats.

vishnuvram · 2024-08-10T17:07:28 1723309648

I really liked Richard McElreath’s Statistical rethinking https://youtube.com/playlist?list=PLDcUM9US4XdPz-KxHM4XHt7uU...

SkyPuncher · 2024-08-10T19:13:05 1723317185

In my experience, there’s just not much depth to the math behind A/B testing. It all comes down to does A or B affect X parameter without negatively affecting Y. This is all basic analysis stuff.

The harsh reality is A/B testing is only an optimization technique. It’s not going to fix fundamental problems with your product or app. In nearly everything I’ve done, it’s been a far better investment to focus on delivering more features and more value. It’s much easier to build a new feature that moves the needle by 1% than it is to polish a turd for 0.5% improvement.

That being said, there are massive exceptions to this. When you’re at scale, fractions of percents can mean multiple millions of dollars of improvements.

RyJones · 2024-08-10T18:24:46 1723314286

I worked for Ron Kohavi - he has a couple books. "Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO", and "Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing". I haven't read the second, but the first is easy to find and peruse.

tgtweak · 2024-08-10T17:50:46 1723312246

No specific literature to recommend but understanding sample size and margin of error/confidence interval calculations will really help you understand a/b testing. Beyond a/b, this will help with multivariate testing as well which has mostly replaced a/b in orgs that are serious about testing.

benreesman · 2024-08-10T20:09:41 1723320581

When seeking to both explore better treatments and also exploit good ones the mathematical formalism often used is a “bandit”.

https://en.m.wikipedia.org/wiki/Multi-armed_bandit

rgbrgb · 2024-08-10T19:38:55 1723318735

One of my fav resources for binomial experiment evaluation + a lot of explanation: https://thumbtack.github.io/abba/demo/abba.html

_0ffh · 2024-08-11T02:21:39 1723342899

It's maybe not what most people would recommend, but I'd suggest you read up on regret minimization and best arm indentification for multi-armed bandit problems. That way it'll probably be useful and fun! =)

daxaxelrod · 2024-08-10T13:32:49 1723296769

Growthbook wrote a short paper on how they evaluate test results continuously.

https://docs.growthbook.io/GrowthBookStatsEngine.pdf

graycat · 2024-08-10T19:05:39 1723316739

Bradley Efron, {\it The Jackknife, the Bootstrap, and Other Resampling Plans,\/} ISBN 0-89871-179-7, SIAM, Philadelphia, 1982.\ \

cpeterso · 2024-08-10T21:40:13 1723326013

Anyone have fun examples of A/B tests you’ve run where the results were surprising or hugely lopsided?

crdrost · 2024-08-10T19:03:13 1723316593

So the thing I always ctrl-F for, to see if a paper or course really knows what it's talking about, is called the “multi-armed bandit” problem. Just ctrl-F bandit, if an A/B tutorial is long enough it will usually mention them.

This is not a foolproof method, I'd call it only ±5 dB of evidence, so it would shift a 50% likely that they know what they're talking about to like 75% if present or 25% if absent, but obviously look at the rest of it and see if that's borne out. And to be clear: Even mentioning it if it's just to dismiss it, counts!

So e.g. I remember reading a whitepaper about “A/B Tests are Leading You Astray” and thinking “hey that's a fun idea, yeah, effect size is too often accidentally conditioned on whether the result was judged significantly significant which would be a source of bias” ...and sure enough a sentence came up, just innocently, like, “you might even have a bandit algorithm! But you had to use your judgment to discern that that was appropriate in context.” And it’s like “OK, you know about bandits but you are explicitly interested in human discernment and human decision making, great.” So, +5 dB to you.

And on the flip-side if it makes reference to A/B testing but it's decently long and never mentions bandits then there's only maybe a 25% chance they know what they are talking about. It can still happen, you might see e.g. χ² instead of the t-test [because usually you don't have just “converted” vs “did not convert”... can your analytics grab “thought about it for more than 10s but did not convert” etc.?] or something else that piques interest. Or it's a very short article where it just didn't come up, but that's fine because we are, when reading, performing a secret cost-benefit analysis and short articles have very low cost.

For a non-technical thing you can give to your coworkers, consider https://medium.com/jonathans-musings/ab-testing-101-5576de64...

Researching this comment led to this video which looks interesting and I’ll need to watch later about how you have to pin down the time needed to properly make the choices in A/B testing: https://youtu.be/Fs8mTrkNpfM?si=ghsOgDEpp43yRmd8

Some more academic looking discussions of bandit algorithms that I can't vouch for personally, but would be my first stops:

- https://courses.cs.washington.edu/courses/cse599i/21wi/resou... - https://tor-lattimore.com/downloads/book/book.pdf - http://proceedings.mlr.press/v35/kaufmann14.pdf