Statistical significance on a shoestring budget

grega5 · on Sept 14, 2023

First, you really should move away from frequentist statistical testing and use Bayesian statistics instead. It is perfect for such occasions where you want to adjust your beliefs in what UX is best based on empirical data to support your decision. With collecting data you are increasing confidence in your decision rather than trying to meet an arbitrary criterion of a specific p-value.

Second, the “run-in-parallel” approach has a well defined name in experimental design, called a factorial design. The diagram shown is an example of full factorial design in which each level of each factor is combined with each level of all other factors. The advantage of such design is that interactions between factors can be tested as well. If there are good reasons to believe that there are no interactions between the different factors then you could use a partial factorial design that, which has the advantage of having less total combinations of levels while still enabling estimation of effects of individual factors.

scottfr · on Sept 14, 2023

Disagree on using Bayesian statistics. Frequentist statistics are perfect for A/B testing.

There are so many strong biases people have about different parts about UI/UX. One of the significant benefits of A/B testing is that it lets you move ahead as a team and make decisions even when there are strongly differing opinions on your team. In these cases you can just "A/B test" and let the data decide.

But if you are using Bayesian approaches you'll transition those internal arguments to what the prior should be and it will be harder to get alignment based on the data.

eru · on Sept 14, 2023

Not necessarily.

You can present your Bayesian approaches in such a way that it's almost independent of the prior. Your output will be 'this experiment should shift your odds-ratio by so-and-so-many logits in this or that direction' instead of an absolute probability.

JHonaker · on Sept 14, 2023

You have to make almost the exact same choices when you rely on fequentist tools. The main difference is they’re pre-made during the development of the tool, so you don’t get insight into what they are without studying the theory behind the test.

grega5 · on Sept 14, 2023

We can agree to disagree. My claim is actually quite the reverse. For A/B testing specifically, Bayes is much better suited to address the practical questions you would usually have when running A/B experiments. See my response to AlexeyMK below.

miksumiksu · on Sept 14, 2023

Fixing dysfunctional decision making by delegating it to "data", what could go wrong? Might as well flip a coin and save the money.

AlexeyMK · on Sept 14, 2023

Thanks for factorial design! I'll update the post to the proper nomenclature.

The frequentist/bayesian debate is not one I understand well enough to opine - do you have any reading you'd recommend for this topic?

grega5 · on Sept 14, 2023

I myself am a rather recent convert to using Bayesian statistic, for the simple reason, that I was trained and have used frequentist statistics extensively in the past and I had no experience using Bayesian statistics. Once you take the time to master the basic tools, it becomes quite straightforward to use. I am currently away from my computer and resources, which makes it difficult to suggest them. As a somewhat shameless plug, you could check the https://www.frontiersin.org/articles/10.3389/fpsyg.2020.0094... paper and the related R-package https://cran.r-project.org/web/packages/bayes4psy/index.html and GitHub repository https://github.com/bstatcomp/bayes4psy, which were made to be accessible to users with frequentist statistics experience.

To brutaly simplify the distinction. Using frequentist statistics and testing, you are addressing the question, whether based on the results, you can reject the hypothesis that there is no difference between two conditions (e.g., A and B in A/B testing). The p-value broadly gives you the probability that the data from A and B are sampled from the same distribution. If this is really low, then you can reject the null hypothesis and claim that there are statistically significant differences between the two conditions.

In comparison, using Bayes statistic, you can estimate the pobability of a specific hypothesis. E.g. the hypothesis that A is better than B. You start with a prior belief (prior) in your hypothesis and then compute the posterior probability, which is the prior adjusted for the additional empirical evidence that you have collected. The results that you get can help you address a number of questions. For instance, (i) what is the probability that in general A leads to better results than B. Or related (but substantially different), (ii) what is the probability that for any specific case using A you have a higher chance of success than using B. To illustrate the difference, the probability that men in general are taller than women approaches 100%. However, if you randomly pick a man and a woman, the probability that the man will be higher than the woman is substantially lower.

In your A/B testing, if the cost of A is higher, addressing the question (ii) would be more informative than question (i). You can be quite sure that A is in general better than B, however, is the difference big enough to offset the higher cost?

Related to that, in Bayes statistics, you can define the Region of Practical Equivalence (ROPE) - in short the difference between A and B that could be due to measurment error, or that would be in practice of no use. You can then check in what proportion of cases, the difference would fall within ROPE. If the proportion of cases is high enough (e.g. 90%) then you can conclude that in practice it makes no difference whether you use A or B. In frequentist terms, Bayes allows you to confirm a null hypothesis, something that is impossible using frequentist statistic.

In regards to priors - which another person has mentioned - if you do not have specific reason to believe beforehand that A might be better than B or vice versa, you can use a relatively uninformative prior, basically saying, “I don’t really have a clue, which might be better”. So issue of priors should not discourage you to using Bayes statistics.

philipodonnell · on Sept 14, 2023

Any similar GitHub for Python you recommend?

jvans · on Sept 14, 2023

Building your own bayesian model with something like pymc3 is also a very reasonable approach to take with small data or data with too much variance to detect effects in a timely manner. This also forces you to think about the underlying distributions that generate your data which is an exercise in itself that can yield interesting insights.

AlexeyMK · on Sept 14, 2023

[Author here] Heh - yes but don't, though...

Yes: you could use bayesian priors and a custom model to give yourself more confidence from less data. But...

Don't: for most businesses that are so early they can't get enough users to hit stat-sig, you're likely to be better off leveraging your engineering efforts towards making the product better instead of building custom statistical models. This is nerd-sniping-adjacent, (https://xkcd.com/356/) a common trap engineers can fall into: it's more fun to solve the novel technical problem than the actual business problem.

Though: there are a small set of companies with large scale but small data, for whom the custom stats approaches _do_ make sense. When I was at Opendoor, even though we had billions of dollars of GMV, we only bought a few thousand homes a month, so the Data Science folks used fun statistical approaches like Pair Matching (https://www.rockstepsolutions.com/blog/pair-matching/) and CUPED (now available off the shelf - https://www.geteppo.com/features/cuped) to squeeze a bit more signal from less data.

j7ake · on Sept 14, 2023

I love fitting models.

I always say in my profession I will fit models for free, it’s having to clean data and “finish” a project that I demand payment.

kimi · on Sept 14, 2023

...and pictures in a format the journal likes....

hammock · on Sept 14, 2023

That works for a website. Doesn't work as well for direct mail

charlierguo · on Sept 14, 2023

> Gut Check: Especially if you’re off by quite a bit, this is a chance to take a step back and ask whether the company has reached growth scale or not. It could be that there are plenty of obvious 0-1 tactics left. Not everything has to be an experiment.

This is a key point, imo. I have a sneaking suspicion that a lot of companies are running "growth teams" that don't have the scale where it actually makes sense to do so.

bertil · on Sept 14, 2023

Everything has to be a test early on, but not every test has to rely on random-split-based statistical significance to make a decision. “Would you pay $20 for this?” is a classic way to judge whether your service has product-market fit, and it’s not about sample sizes, not initially.

Some growth teams are trying more exploratory approaches to find something that resonates with simpler approaches. Others rely on A/B tests. Different profiles, but both are “Growth teams”.

quickthrower2 · on Sept 14, 2023

How are A/B tests for non-logged-in users doing these days with all the cookie banners, privacy and ad blockers etc. Is this stuff still working even?

AlexeyMK · on Sept 14, 2023

Yes, but it is _rough_. What actually hurts is "browse on mobile, buy on desktop" type behavior.

Still worth doing, but you end up needing more black magic than you'd like (IP-based assignment, Ad Network-sourced assignment, CDN proxies for Analytics tools, etc).

Working on a separate post about that.

Fomite · on Sept 14, 2023

There's an argument to be made that, so long as your testing fully encompasses all visitors to your site, you aren't sampling the population, you're fully observing it, and statistical significance is irrelevant.

RA_Fisher · on Sept 14, 2023

Sites are always getting new visitors, losing old ones and the ones they’ve observed return irregularly (or commonly, or somewhere in between). So it’s not realistic to assume a given sample of visitors is the population.

Fomite · on Sept 14, 2023

If future users are markedly different than past users, a p-value isn't going to help you here.

It's not unreasonable to assume it's a sample, I just don't think it's worth getting paralyzed by worrying about whether or not you have power, or getting into hacky tricks to try to fix it.

...but most power calculations are also sort of bullshit.

RA_Fisher · on Sept 27, 2023

It may be that after analyzing the data, we still have substantial uncertainty, just depends on the process’s inherent variability and what the data provides in terms of information (a function of the skill of the person determining what should be collected).

bertil · on Sept 14, 2023

That argument is missing that you are using past users’ behaviour as representative of future users’ preferences. You are not sampling marbles in a jar, but making a lot of assumptions, notably about continuity.

Fomite · on Sept 14, 2023

This is also an assumption of any approach that uses statistical analysis.

malf · on Sept 14, 2023

“Using modern experiment frameworks, all 3 of ideas can be safely tested at once, using parallel A/B tests (see chart).”

Nooo! First, if one actually works, you’ve massively increased the “noise” for the other experiments, so your significance calculation is now off. Second, xkcd 882.

AlexeyMK · on Sept 14, 2023

> Nooo! First, if one actually works, you’ve massively increased the “noise” for the other experiments

I get that a bunch at some of my clients. It's a common misconception. Let's say experiment B is 10% better than control but we're also running experiment C at the same time. Since C's participants are evenly distributed across B's branches, by default they should have no impact on the other experiment.

If you do a pre/post comparison, you'll notice that for whatever reason, both branches of C are doing 5% better than prior time periods, and this is because half of them are in the winner branch of B.

NOW - imagine that the C variant is only an improvement _if_ you also include the B variant. That's where you need to be careful about monitoring experiment interactions, I called out in the guide. But better so spend a half day writing an "experiment interaction" query than two weeks waiting for the experiments to run in sequence.

> Second, xkcd 882 (https://xkcd.com/882/) I think you're referencing P-hacking, right?

That is a valid concern to be vigilant for. In this case, XKCD is calling out the "find a subgroup that happens to be positive" hack (also here, https://xkcd.com/1478/). However, here we're testing (a) 3 different ideas and (b) only testing each of them once on the entire population. No p-hacking here (far as I can tell, happy to learn otherwise), but good that you're keeping an eye out for it.

yorwba · on Sept 14, 2023

The more experiments you run in parallel, the more likely it becomes that at least one experiment's branches do not have an even distribution across all branches of all (combinations of) other experiments.

And the more experiments you run, whether in parallel or sequentially, the more likely you're to get at least one false positive, i.e. p-hacking. XKCD is using "find a subgroup that happens to be positive" to make it funnier, but it's simply "find an experiment that happens to be positive". To correct for p-hacking, you would have to lower your threshold for each experiment, requiring a larger sample size, negating the benefits you thought you were getting by running more experiments with the same samples.

lukego · on Sept 14, 2023

... and one such correction is the (simple, conservative, underused) Bonferroni Correction.

AlexeyMK · on Sept 14, 2023

Super helpful - looked it up, will aim to apply next time!

Curious how the bonferroni correction applies in cases where the overlap is partial - IE, experiment A ran from Day 1 to 14, and experiment B ran (on the same group) from days 8 to 21. Do you just apply the correction as if there was full overlap?

lukego · on Sept 14, 2023

I believe you would apply the correction for every comparison you make regardless of the conditions. It's a conservative default to avoid accidentally p-hacking.

There might be other more specific corrections that give you power in a specific case. I don't know about that, I went Bayesian somewhere around this point myself.

RandomLensman · on Sept 14, 2023

There are a bunch of procedures under the label Family-wise Error Correction, some have issues in situations with non-independence (Bonferoni can handle any dependency structure, I think).

If there are a lot of tests/comparisons could also look at controlling for the False Discovery Rate (usually increases power at the expense of more type I errors).

AlexeyMK · on Sept 14, 2023

Thanks, that is a well reasoned argument!

My take is for small n (say 5 experiments at once) with lots of subjects (>10k participants per branch) and a decent hashing algorithm, the risk of uneven bucketing remains negligible. Is my intuition off?

False positives for experiments is definitely something to keep an eye on. The question to ask is what is our comfort level for trading-off between false positives and velocity. This feels similar to the IRB debate to me, where being too restrictive hurts progress more than it prevents harm.

bertil · on Sept 14, 2023

No, the risk of uneven bucketing of more than 1% is minimal, and even when it’s the case, the contamination is much smaller than other factors. It’s also trivial to monitor at small scales.

False positives do happen (Twyman's law is the most common way to describe the problem: underpowered experiment with spectacular results). The best solution is to ask if the results make sense using product intuition and continue running the experiment if not.

They are more likely to happen with very skewed observations (like how much people spend on a luxury brand), so if you have a goal metric that is skewed at the unit level, maybe think about statistical correction, or bootstrapping confidence intervals.

bertil · on Sept 14, 2023

You are confusing:

a. the Family-Wise Error Rate (FWER what xkcd 882 is about) and the many solutions of Multiple Comparison Correction (MCC: Bonferoni, Homes-Sidak, Benjamini-Hochberg, etc.) with

b. Contamination or Interaction: your two variants are not equivalent because one has 52% of its members part of Control from another experiment, while the other variant has 48%.

FWER is a common concern among statisticians when testing, but one with simple solutions. Contamination is a frequent concern among stakeholders, but very rare to observe even with a small sample size, and that even more rarely has a meaningful impact on results. Let’s say you have a 4% overhang, and the other experiment has a remarkably large 2% impact on a key metric. The contamination is only 4% * 2% = 0.08%.

It is a common concern and, therefore, needs to be discussed, but as Lukas Vermeer explained here [0], the solutions are simple and not frequently needed.

[0] https://www.lukasvermeer.nl/publications/2023/04/04/avoiding...