Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I disagree, if a person has to make a decision under uncertainty, and a priori favors neither group A or B, then they might as well use any visitor information available to them to guide their choice.

They just shouldn't be too confident they've made the correct choice.



You are just using noise then. It's not a matter of opinion, it's statistics.


If you are waiting for N observations, so that a NHST will have some level of power, and you assume each observations is drawn from the same distribution (as your test likely does), then you do not see each observation as noise.

You will just be acting under reduced certainty, but if you have to act, any information is better than no information.

(I'd be very interested to hear your statistical explanation).


The trouble is disproving the null hypothesis. In your test, if one variant beats another, you take that as a weak signal that one may be better than the other. The data doesn't support this. Without applying a standard to your p-value, you cannot disprove the null hypothesis: that your variant is likely no better or worse.

I'm not a statistician, but I've run a lot of b-tests.


You're ignoring closed's point that "a priori favors neither group A or B".

If you are starting from a neutral position, considering two possible alternatives with neither presumed to be more favourable than the other, then any statistical test based on using one outcome as null and the other as alternative hypothesis is fundamentally inappropriate. Any such test inherently favours one outcome over the other, rather than starting from a neutral position.

As closed is trying to explain, if you really do start from neutral then even a tiny number of data points is still better than no data at all. You shouldn't have too much confidence in whether you're really making the right decision, but if you have to make a decision, you are still more likely to make the right one if you go with what the data tells you, even if it's only telling you by a very small margin.


Ok so walk me through this in practice..

The way I see it, you need to prove that A is better than B by a sufficient margin to be distinguishable from pure noise.

So, imagine you put up a landing page with 2 variants. Each one gets 500 visitors. You have a conversion on one, but not the other. It's your suggestion here that there is some significance to that single conversion?

I think the problem is, you have no idea if that user would've converted had she landed on the opposite variant. That is, you can't disprove the idea that your test makes no impact at all.


You're still thinking in terms of one version being the default and the other an alternative that must be positively proven to be better. If you are in a situation where you have cases A and B and no particular reason to believe a priori that either is more likely to be better than the other, that's a fundamentally different situation.

And in that situation, yes, if you run both versions with randomised visitors and you observe a small but non-zero sample where one converted and the other did not, that is evidence that one version may be better than the other. It's not particularly strong evidence, but it is a non-zero amount of evidence in one direction over the other, and that's better than the nothing at all that you had to separate the cases to start with.

Therefore, if you must make a choice about whether to adopt one version or the other at that stage, then in the absence of any better evidence, it is more likely that the version that has converted performs better than the version that has not and logically you should adopt the one that converted.

Of course in reality you would probably prefer to collect stronger evidence before making a decision if that is possible. But if it's not then, as closed wrote before, any information is better than no information at all.


Have you ever watched a test against a lot of traffic? In variants with 50k test, 50k control each day you can see wild swings from one day to the next, until you reach statistical significance.

I think you and the other guy want that single conversion to be evidence, but in reality, it's statistical noise.

A coin flip assigned that user to that variant. If they were going to convert anyway, you will be deriving meaning from pure coin flip chance, and you have no way of knowing with a single conversion whether this is true.

Again, it's not about going in with an assumption of which is better, it's about realizing that in split testing the biggest challenge is disproving the null hypothesis.


I think you and the other guy want that single conversion to be evidence, but in reality, it's statistical noise.

It is evidence, just like any other properly collected data point. It's just very weak evidence, is what we're saying.

Of course in real world situations there may be a lot of variance and the correct answer may well turn out to be the other one. But in the absence of additional information, that is true for literally any number of samples that is less than whatever proportion of the population would give you absolute proof that your chosen answer is correct. If you have 50%-1 samples and every single one went with option A, you're still wrong if the other 50%+1 would have gone for option B.

What you're calling "noise" is an ill-defined concept. Qualitatively there is no difference for a result in a two-way test between a single sample and 50%-1. You still don't know for sure which answer is the right one. However, you're going to be much more confident about having the right answer in the latter case, which is what I think closed was trying to explain to you.

Again, it's not about going in with an assumption of which is better, it's about realizing that in split testing the biggest challenge is disproving the null hypothesis.

But if you're running a test with null and alternative hypotheses, you are going in with an a priori preference for one outcome over the other. You are literally saying that if the result is close enough, you will prefer not to reject the null hypothesis, and therefore whichever variation you have arbitrarily chosen to be your null hypothesis will be the answer.

That is self-evidently not a neutral assessment of option A vs. option B, and therefore there will be some cases where your test is more likely than not to make the wrong decision. In short, you are using an inappropriate test for the situation that closed was describing.


Alright, last comment from my side, just to clarify:

>> You are literally saying that if the result is close enough, you will prefer not to reject the null hypothesis, and therefore whichever variation you have arbitrarily chosen to be your null hypothesis will be the answer.

This is a misunderstanding. The null hypothesis is that your two variants have no statistical impact on conversion and any edge you see is just random. That is the hurdle you have to overcome to gain any useful direction from B testing.

GL!


Fair enough, my phrasing before was a little casual, but the underlying point is sound. A hypothesis test might tell you that there is no significant impact on conversion at your chosen level. However, you still have to make a choice between option A and option B. If you have no a priori reason to favour one as the default and no additional data to consider -- which, again, is a crucial detail in the situation closed was talking about -- you should logically still choose whichever option that was most successful during your experiment. This is simply because if your conclusion was correct and there is no impact on conversion then which you pick doesn't matter, but if your result was a false negative then it is more likely that the more successful option during the experiment is the better choice. Given that you're going to pick that one anyway, your hypothesis test hasn't actually provided any useful information to help inform your decision in this scenario.

In any case, we seem to be talking at cross-purposes here, so perhaps we'll have to agree to disagree on this one.


If one variant beats another, even with very few observations, the data DOES support that one is better. It's just that you might not be very confident that one is better.

The key to understanding this situation statistically is by reframing the way you think about tests away from an all-or-nothing NHST, and toward either confidence intervals, or bayesian estimation.

That is, some kind of measure of (loosely) uncertainty around a parameter (or entire model) of interest.


The question then is:

Is the available data more useful than a coin-flip, which would be the alternative method of making a decision.

On the other hand, a coin-flip is probably the better tool. If you can't generate enough data for a statistical sample, then you're probably wasting your time creating an alternative version and setting up an A/B test.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: