Hacker News new | past | comments | ask | show | jobs | submit login
Statisticians Find They Can Agree: It’s Time to Stop Misusing P-Values (fivethirtyeight.com)
302 points by tokenadult on March 7, 2016 | hide | past | favorite | 120 comments



The article submitted here leads to the American Statistical Association statement on the meaning of p values,[1] the first such methodological statement ever formally issued by the association. It's free to read and download. The statement summarizes into these main points, with further explanation in the text of the statement.

"What is a p-value?

"Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

"Principles

"1. P-values can indicate how incompatible the data are with a specified statistical model.

"2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

"3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

"4. Proper inference requires full reporting and transparency.

"5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

"6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis."

[1] "The ASA's statement on p-values: context, process, and purpose"

http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016....


I remember the feeling after my first undergraduate course in statistics essentially being that we stated these principles, then spent the remaining weeks essentially invalidating them without offering any real alternatives. My professor may have been more careful than I remember, but the subtleness was lost on me at the time if that was the case.

The testing of statistical hypotheses always seemed like an odd area of the mathematical sciences to me, even after later taking a graduate mathematical statistics sequence. Like an academic squabble between giants in the field of frequentist inference (Fisher vs. Neyman and Pearson) that ended suddenly without resolution, with the scientific community decided to sloppily merge the two positions for the purposes of publication and forge onward.


> 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

Sure, but is there not a significant correlation between the two in practice? Or would you trust something that gives a 1% p-value equally as one that gives a 99% p-value?

(Yes, I realize it's easy to construct counterexamples, hence why I asked "in practice".)


> "3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

There is no lower threshold at which the data becomes non-predictive?


More likely they mean the opposite: that a statistically significant P value (by whatever threshold you decide to use) should not be used, by itself, to drive policy decisions. Internally, the effect size still matters. Externally, there are numerous other factors that should drive decisionmaking.


Right, but I guess what I'm getting at is frequently we see people doing even worse: making policy decisions based on data that doesn't even hit a minimum threshold for acceptability. I totally agree that people shouldn't make policy decisions solely because the data supports it, but frequently you see people make decisions based on data indicating something without actually being statistically significant. That feels like a bigger problem to me than people actually getting good data but then using it too confidently, which I have rarely if ever seen.

TLDR: People tend to make policy decisions based on data even when the data is basically useless, I'm more worried about that then blindly following good data.


> That feels like a bigger problem to me than people actually getting good data but then using it too confidently

P value alone does not tell you whether you have good data. It just tells you how well a particular model fits the data that you have. It won't (and can't) tell you if your data set is missing data that would alter the P value were it to be included.

P value alone is not enough to say "good" or "bad." That's why this statement matters.


I'm not sure I agree. If P-value is really low then it implies either the existing data set is bad or your new data set is bad. Either way you haven't used data to get to a confident decision.

Put another way: tons of stuff in prod dev gets done without p-values entirely because it's the first time you have any data at all about something. I'm questioning whether this is hugely valuable and obviously better than using my intuition or not, since we really don't have a good understanding of the baseline.


One issue is that if you have a large effect that's consistently and easily reproduced, you don't actually need very accurate measurements or a statistical analysis at all. So any minimum standard would need to take into account.

Another issue is that science is expensive and we need to make decisions all the time whether there is any science backing them or not. So what do you do if there's no science that meets the minimum standards?


> One issue is that if you have a large effect that's consistently and easily reproduced, you don't actually need very accurate measurements or a statistical analysis at all.

I mostly agree with this, but you never have that in a difficult decision.

> Another issue is that science is expensive and we need to make decisions all the time whether there is any science backing them or not. So what do you do if there's no science that meets the minimum standards?

Right! My entire point is that maybe data driven decision making isn't as useful as we think because it mostly doesn't hit scientific levels of accuracy. Maybe there are other, more intuitive systems, that make as much if not more sense, in absence of good data. How would we even know either way? We can't!

Regardless I'd love for the discussion to be "this is the best we can do" rather than "yes we did A/B testing so we know it's true!!!"


>> One issue is that if you have a large effect that's consistently and easily reproduced, you don't actually need very accurate measurements or a statistical analysis at all.

> I mostly agree with this, but you never have that in a difficult decision.

Consider the problem "should we cure congenital deafness in infants?"

To Deaf Community Leaders, this isn't a difficult decision at all; they are strongly against curing deaf children because it makes their power base smaller.

To deaf parents of deaf children, this is a tricky choice. Curing the child's deafness dramatically improves its prospects in life, but it also inevitably cuts the child off from the parents' community. You have to choose between how good you want your child's future to be, and how close you'd like your relationship with them to be.

To hearing parents of deaf children, this is again a trivial choice; obviously you'd cure the child.

BUT, curing deafness is definitely (1) a large effect that is (2) easily reproduced.


Having thought a bit more about this, I want to disagree more strongly with the sentiment "you never have [large, reproducible effects] in a difficult decision". I think difficult decisions necessarily involve that kind of effect.

A couple of things might make a decision difficult:

- Some course of action will produce a big effect. Would that be good or bad?

-- Ok, assume there's a good effect out there to be achieved. Is it worth the cost of obtaining it?

Those questions, variously applied, occupy a lot of people's time and brainpower. In the curing-a-deaf-child example, the parents are making a tradeoff between quality of their child's life (which is good), and closeness to their child (also good). Paying one for the other means giving up something good, which makes the decision nontrivial. But this is a difficult decision because the effects are large, not because they're small.

In contrast, if you're dealing with an effect of very small size, or one that can't be reproduced ("lose weight by following our new diet!")... this might feel like a difficult decision, but it shouldn't. It does not matter what you decide, because (by hypothesis!) your decision will have no effect on the outcome! (Or, for the "very small effect size" case, at most a very small effect.)


This is why the phrase 'An anecdote is not data' chaps my hide. If you have good observability and theory to back it up a single measurement can tell you everything. Where when you don't have that, take all the data points you want. All you have is garbage.


in these responses that is implied and a view inherently held by many: P-Value <-> Science. != not true. A P-Value is just one way, and usually not a very good one. For most business decisions, median/mean (depending on data) absolute error in hold out is great.


> should not be based only on whether a p-value passes a specific threshold.

It seems as though you'd missed the "only" in the clause. Further, they say nothing w.r.t. "does not pass" (aka: not hitting minimum threshold). Both your complaint and their point are orthogonal.


I agree as well!

Here is what probability theory teaches us. The proper role of data is to adjust our prior beliefs about probabilities to posterior beliefs through Bayes' theorem. The challenge is how to best communicate this result to people who may have had a wide range of prior beliefs.

p-values capture a degree of surprise in the result. Naively, a surprising result should catch our attention and cause us to rethink things. This is not a valid statistical procedure, but it IS how we naively think. And the substitution of a complex question for a simpler one is exactly how our brains are set up to handle complex questions about our environment. (I'm currently working through Thinking Fast and Slow which has a lot to say about this.)

Simple Bayesian approaches take the opposite approach. You generally start with some relatively naive prior, and then treat the posterior as being the conclusion. Which is not very realistic if the real prior was something quite different.

Both approaches have a fundamental mistake. The mistake is that we are taking a data set and asking what it TELLS us about the world. When in probability theory the real role of data is how to UPDATE our views about the world.

This is why I have come to believe that for simple A/B testing, thinking about p-values is a mistake. The only three pieces of information that you need is how much data you are willing to collect, how much you have collected, and how big the performance difference is. Stop either when you have hit the maximum amount of data you're willing to throw at the test, or when the difference exceeds the square root of that maximum amount. This is about as good as any simple rule can do.

If you try to be clever with p-values you will generally wind up putting yourself at saving yourself some effort in return for a small risk per test of making very bad mistakes. Accepting small risk per test over many tests for a long time puts you at high odds of eventually making a catastrophic mistake. This is a very bad tradeoff.

I've personally seen a bad A/B test with a low p-value rolled out that produced a 15% loss in business for a company whose revenues were in the tens of millions annually. It. Was. Not. Pretty. (The problem was eventually found and fixed..a year later and after considerable turnover among the executive team.)


Simple Bayesian approaches take the opposite approach. You generally start with some relatively naive prior, and then treat the posterior as being the conclusion. Which is not very realistic if the real prior was something quite different.

I don't think this is a completely accurate portrayal of Bayesian stats. In Bayesian stats, there is no "real prior". Probability distributions are all subjective representations of belief. The prior is just what you believe prior to evidence, and the posterior is what you believe after you've taken evidence into account.

That said, moving away from p-values and towards something more robust is something the A/B testing industry needs. (Obviously I have my own opinion of what that something should be, and it's a bit different from what you are advocating.) There are far too many consultancies and agencies p-hacking their way to positive results ("hey unsophisticated client - guess what I made your conversion rate go up 25%!") and I'd love to see every one of them die.


There are different approaches a Bayesian might take. The one that I described is certainly among them, though it is not the only one.


I think the word "naive" is problematic here. Have you seen instances where Bayesians choose a prior that isn't at least somewhat informed by exploratory analysis?


It's quite common to choose a conjugate prior, which aids computation, and which is readily comprehensible within the discipline ("...and a Wishart prior for the covariance, of course..."). But which is, in effect, not at all informed by the data.

It's also common to have a complex, multi-level model setup which has a few hyperparameters (say, gamma distribution shapes) which are set basically arbitrarily. The idea being that the posteriors for the lowest levels (closest to data) will be learned/fitted, but the hyper parameters at the top-level of the model are fixed. This is very common in spatial statistics.

It's also common to make lots of (conditional) independence assumptions, just because they are convenient. Such as diagonal covariances, or conditional independence between "separate" elements of a model. But these independence assumptions are often just convenient or based on intuition, but not re-checked. Or, if they are checked and found wanting, it's "left for future work".

These practices are defensible, and attackable. But choosing convenient priors out of a bag of standard priors, without reference to the problem at hand, is very common.


There is a notion of an uninformative prior [0].

What would you do if you were studying a process that had not been previously studied, so you had no previous experience/data on which to base your a prior? If you wait, and choose a prior based on the data you observe then you are using an empirical Bayes method [1].

[0]: https://en.wikipedia.org/wiki/Prior_probability#Uninformativ...

[1]: https://en.wikipedia.org/wiki/Empirical_Bayes_method


A uniform prior would be one example. That's just saying "I've got no idea what the value is".



EDIT: I forgot about the prevalence of uninformed priors. Thanks HN!


There is nothing more robust than p-values. The problem is that people call things "p-value of this experiment" when they aren't (for example, they may actually be talking about the "p-value of this other experiment" [which they possibly didn't even perform]).


Thanks for the very interesting thoughts.

Could you explain a bit how your rule of thumb works and why it's better than p-values? Why is the difference vs. the square root of the max available sample size a meaningful measure?


The idea is that you will decide when you've either expended as much energy as you are willing to, or when you're convinced that you won't make a different decision.

There is a simple symmetry argument that shows that the odds of a random walk reaching sqrt(N) in one direction and then getting to the opposite direction by N observations is exactly the same as the odds of a random walk reaching 2 sqrt(N) by the time you reach N observations. So the sqrt(N) rule is a simplified version of, "Call off the test when you've reached 95% certainty that we have our final answer."

You can easily make a million minor improvements. But I can't come up with a simpler rule meeting the business needs which is less likely to lead to dangerous misunderstandings.

"Decide when you pass this threshold, or when you've put out as much effort as you're willing to."

And the effort required to testing are easily understood as well.

"The cost of a bad A/B test is at most 100 lost sales and remembering it is there for 2 months." (This would be for an organization that expects 10,000 sales in 2 months, and is willing to leave tests up for that long.)

And if memory serves me (I'd have to redo the calculation) the rule of thumb in this case is that wins below 1% are called randomly, above 2% are usually called right, you're confident of getting the right answer above 5%, and even over many tests you are confident of never making a 10% mistake.

That last rule of thumb is important. Small organizations can't benefit from A/B tests. Large ones are stupid not to use it heavily. And very few people have a good sense of whether it makes sense for them.


Did you know that your rule of "do the tests until significance is reached" denies the use of chain probability and most tests based on it?

You can show that in the limit, your rule will always reach any given threshold just by pure luck.

You really need to set the number of trials or samples before performing them.

Here's a more in depth explanation: http://www.evanmiller.org/how-not-to-run-an-ab-test.html


The procedure I suggested starts with setting the maximum number of conversions before beginning the test. It is emphatically not the procedure that Evan Miller criticizes.

In fact see http://www.evanmiller.org/sequential-ab-testing.html for Evan Miller suggesting a similar procedure to mine, based on past conversations that I have had with him. The difference between that one and mine is that I focus on "make the best decision available, even though it may be only little better than a coin flip" while he tries to only decide when you can decide with confidence. In past conversations we have agreed that each other's procedures are reasonable, but they achieve different goals.


Thanks! That's very helpful.


The way I see p-values is a mathematical way to share how certain you are that something is in the realm of reality (not truth). It is not a formal conclusion of the results, and is nothing more than a "spoiler" indicating what you might also conclude.


This point of view is not supported by the mathematics. If you think otherwise, then you do not understand the mathematics.


It's not just p-values. Some people just don't understand even very basic statistics.

I remember talking to one person in marketing who ran surveys of the company's users. They would send out a survey to all registered users, get back responses from 1% of them or something, and then proceed to report findings based on the responses. They were really happy, since a 1% response rate is great for surveys like this.

I tried to explain to them that all of this statistical machinery relies on having a random sample, and a self-selected sample is not that. No effect whatsoever. Surveys like this are standard practice in the industry. Why are you making trouble, geek-boy?


" Surveys like this are standard practice in the industry."

So, how does one exploit this apparent bad practice? I.e., how does one make a profit from others making this mistake? The answer is, of course, one doesn't - otherwise others would've done so. So what does this tell us about the state of affairs? Is it that bad sampling doesn't matter for practical purposes, or that marketing research is useless? I don't know, and I'm not trying to be belligerent here. A situation similar to this shows up in dozens of places every day - more often than not, it doesn't matter if things are done 'right', there are large margins within which 'good enough' is indistinguishable from 'right'.

This bothers me greatly, but it's hard to argue against this conclusion, empirically. How to deal with this cognitive dissonance? I mean this is the exact topic at least half of the blog posts that make it to the HN front page deal are fundamentally about.


Had an indergrad ecology course to teach me this. The first 4 labs were plug and chug equations sampling a square with computer generated randomly placed dots. All we had to do is get a sample (we could choose the size), and figure out a 95% likelihood that the correct number was within our range. No one could do it for 3 labs, before he told us how. Good samples are REALLY important, even under basically perfect conditions.


P-values are so weird. Studies should instead report a likelihood ratio. A likelihood ratio is mathematically correct, and tells you exactly how much to update a hypothesis.

You can convert p-values to likelihood ratios, and they are quite similar. But its not perfect. A p value of 0.05 becomes 100:5, or 20:1. Which means it increases the odds of a hypothesis by 20. So a probability of 1% updates to 17%, which is still quite small.

But that assumes that the hypothesis has a 100% chance of producing the same or greater result, which is unlikely. Instead it might only be 50% which is half as much evidence.

In the extreme case, it could be only 5% likely to produce the result, which means the likelihood ratio is 5:5 and is literally no evidence, but still has a p value of 0.05.

Anyway likelihood ratios accumulate exponentially, since they multiply together. As long as there is no publication bias, you can take a few weak studies and produce a single very strong likelihood update.


For context, I have a degree in statistics, and I did research with Andrew Gelman (one of the statisticians quoted in the article).

Glad to see this is gaining traction! I've been saying this for years: the world would actually be in a better place if we just abandoned p-values altogether.

Hypothesis testing is taught in introductory statistics courses because the calculations involved are deceptively easy, whereas more sophisticated statistical techniques would be difficult without any background in linear algebra or calculus.

Unfortunately, this enables researchers in all sorts of fields to make incredibly spurious statistical analyses that look convincing, because "all the calculations are right", even though they're using completely the wrong tool.

Andrew Gelman, quoted in the article, feels very strongly that F-tests are always unnecessary[0]. I'd go as far as to extend that logic to the Student's t-test and any other related test as well.

You can get into all sorts of confusing "paradoxes" with p-values. One of my favorites:

Alice wants to figure out the average height of population. Her null hypothesis is 65 inches. She conducts a simple random sample, performs a t-test, and determines that the sample mean is 70 inches, with a p-value of .01.

In an alternate universe, Bob does the same thing, with the same null hypothesis (65 inches). He determines that the sample mean is 90 inches, with a p-value of .000001.

Some questions:

A) Does Bob's experiment provide stronger evidence for rejecting the null hypothesis than Alice's does?

B) In Bob's universe, is the true population mean higher than it is in Alice's universe?

By pure hypothesis testing alone, the correct answer to both questions is "no", even though the intuitive answer to both questions is "yes"[1].

[0] http://andrewgelman.com/2009/05/18/noooooooooooooo/

[1] Part of the problem is that we do expect that, in Bob's universe, the true population mean is highly likely to be higher, and this is supported by the data. Trouble is, the reason we expect that is not formally related to hypothesis testing and t-tests/p-values.


If you're doing a measurement, why have a null hypothesis? Alice should sample the population at random, take the height measurements, calculate the average, plot the distribution, calculate the variance. If the distribution is not sufficiently smooth then continue to take measurements until it's smooth, or unchanging. Then Alice is done discovering all there is to know about the height distribution of the population she sampled. Same with Bob.


> Alice should sample the population at random, take the height measurements, calculate the average, plot the distribution, calculate the variance

A student-t test is basically a lossy encoding of that information you describe (the sample mean, the variance, and the sample size).

> If the distribution is not sufficiently smooth then continue to take measurements until it's smooth, or unchanging.

From a frequentist standpoint, this would be considered sloppy and bad methodology. You don't keep sampling until you get the results you want (or until the null hypothesis is invalidated).

That said, as I mention in a comment above, the fact that you can keep sampling until the null hypothesis is invalidated (and that it is mathematically always guaranteed to happen eventually) is a big problem with the concept of hypothesis testing in the first place.


Creating an 100(1-a)% interval estimate of the population mean based on the sample's mean and variance is equivalent to examining the set of all values which produce a p-value below a%. Your view of the way Alice and Bob should be making inference about the true mean is actually exactly what hypothesis testing is. It's just that there is often a "status quo" belief about the parameter which has important theoretical implications. And academic papers are usually about that belief specifically, so it makes sense that they talk about it as a "hypothesis test". And that's the logical headline, the statement about the specific value. But standard deviations and distributions are often given, making it trivial to create the interval as a reader.


This paradox is a great example of why I'm uncomfortable with the idea of coming up with a p-value when you're comparing a sample mean to a point value in the first place.

If your null hypothesis amounts to little more than a scalar value that exists in a vacuum, you haven't really got a null hypothesis. If your null hypothesis is that your results will match what was seen in some other data set collected by some other person who may or may not have been using the same protocol as you, your null hypothesis describes a parallel universe and there's no way to draw an apples-to-apples comparison to it and your alternative hypothesis, which concerns data that did not come from that parallel universe.

So I guess my answer to all questions would be, "Mu."


I can't understand what is the problem with p-values. You should never talk about the results, the essential point is that the method provides valid conclusions 95% and invalid conclusions 5% of the time. If your conclusion is wrong (you are in the 5% part) that is not a paradox. Also you should not try to prove things since your conclusion can be wrong, you should be glad to have a procedure that gives you useful information but is not infallible. By being humble you solve the problem.


For giggles and grins, my aunt and uncle are an actual example of one of the classic frequentist vs bayesian examples where frequentist statistics says that something utterly irrelevant should matter.

Scenario 1 (true). Bill and Lorena had 7 children. 6 were boys, 1 was a girl. Are they biased towards having one gender? A 2-sided p-value says that there are 16 possibilities of this strength or more, each of which has probability 1/2^7, for a p-value of 2^4/2^7 = 1/2^3 = 0.125. We therefore fail to reject the null hypothesis at a p-value of 0.05.

Scenario 2 (also true): Bill and Lorena decided to have children until they had a boy and a girl. They had 6 boys then a girl. Are they biased towards having one gender? An event this unlikely could only happen if they had 6 of the same gender in a row, which is 2 possibilities of probability 1/2^6, for a probability of 2/2^5 = 1/2^5 = 0.03125. We therefore reject the null hypothesis at a p-value of 0.05.

Now the Bayesian gotcha. According to Bayes' theorem, the intent of Bill and Lorena can have absolutely NO impact on ANY calculation of posterior probabilities from prior expectations. There is no logical way in which this fact should matter at all. And yet it did!


Is this actually true for the Bayesian model? Throw in a parameter for whether or not Bill and Lorena were trying to have children until they had a boy and a girl. Now your answer will depend quite heavily on your prior!


This is entirely true.

The posterior odds are computed ENTIRELY from the odds of the observed events under the prior beliefs. There is NO WAY in which might-have-beens and didn't-happens enter in. Therefore the posterior probabilities cannot depend on the knowledge of what they would have done if something different had happened.

Of course frequentist statistics are heavily affected by what would have happened if something different had happened.


No, even as a Bayesian I could use either of two models to understand my data:

In world one, I use bayesian update on the model P(# boys | Bias) = Mutltinomial(n = 7, p = Bias)

In world two, I use bayesian update on the model P(# children | Bias) = Geometric(p = Bias)

Might-have-beens and didn't happens do play in, in my choice of model. I should choose the one that I believe, and if I'm not certain, I should use an even more complicated model that incorporates my beliefs about what models might be appropriate.


Not if you follow Bayes' theorem. If you start with a prior distribution of beliefs about the likelihood of various ratios of boy vs girl births, the posterior distribution only depends on the observed outcomes. And the posterior distribution is exactly given by Bayes' theorem.

One possible source of confusion for you is that Bayesian ideas have been a source of inspiration for a lot of ad hoc techniques (eg naive Bayes) which do NOT follow Bayesian rules of inference. The reason is that exact inference in Bayes nets is NP-hard. So you're used to hearing "Bayesian" applied to things that have nothing to do with Bayes' formula.


Our goal is to calculate P(pb > 0.5 | "Six girls and one boy"), where pb is the probability of having a boy. (Ignoring that we have already assumed p is fixed), by applying Bayes' Theorem, we have:

P(pb > 0.5 | "Six girls and one boy") = (P("Six girls and one boy" | pb > 0.5) * P(pb > 0.5)) / P("Six girls and one boy")

Applying Bayes theorem thus requires us to calculate P("Six girls and one boy" | pb > 0.5). How do you suggest that we do this? Why is your answer the unique correct solution?


Well first you have to start with a prior distribution of beliefs, which that is not. And to reduce confusion I'll switch back to the actual genders (6 boys then a girl).

Suppose our prior distribution of beliefs is 0.5 that the probability is exactly 1/2, versus 0.5 that the probability of a boy is some value P which is equally likely to be any value from 0 to 1.

In the first case, the probability of 6 boys and 1 girl is 0.5^7 = 1/2^7. In the second case the probability of 6 boys and 1 girl is P^6(1-P) = P^6 - P^7. The integral from 0 to 1 of P^6 - P^7 is 1/6-1/7 = 1/42. Each case also has a priori odds of 1/2 of holding true.

After observing 6 boys and 1 girl, the first case now has probability (0.5/2^7)/(0.5/2^7 + 0.5/42) = 1/(1 + 64/21) = 21/129 = 0.162790697674419. The second case now has probability 1 - this, which is 0.837209302325581. Furthermore if the second case is true, P is no longer uniformly distributed. In fact its density is now proportional to P^6-P^7.

So the posterior distribution is now going to be:

With probability 21/129, exactly 0.5. And otherwise any value P from 0 to 1 with a probability density of 108/129*(P^6-P^7)/42.

Given this prior and this set of observations, any other answer is wrong. Given a different prior you would get a different posterior, but as long as the prior gives a constant probability of male/female, the only fact that matters is how many boys and girls there are.

The order of births can only start to matter if you start with a prior that gives different probabilities of different genders based on prior events. Even then it is hard to come up with a realistic scenario in which the plans of the parents would make an order of magnitude difference in the posterior distribution.


But if they are having children until they have one of each, the probability of the different observations does depend on the prior events!


The absolute probability of the observation is irrelevant. Only the RELATIVE probabilities of said observation under the different possible theories which are part of the prior.

If the set of prior theories does not include anything that depends on birth order, then birth order and the experimental design are irrelevant to the posterior conclusions.


Right, but this is exactly my point: the correct answer depends on the set of prior theories, which is exactly what the frequentist scenarios you consider are varying.


If you are arguing THAT point, then you shouldn't have disagreed with me anywhere!

We have 3 types of facts.

1. What is the set of prior theories? Probability theory says this should matter. Bayesians are explicit about its involvement. Frequentists ignore it.

2. Observed data. Probability theory says that this should matter. Everyone takes this into account.

3. Experimental design for what would have happened had something different than the observed actually happened. This matters a lot to frequentist approaches and does not matter at all to Bayes' theorem. Bayesian approaches generally do not care about it at all.

The difference between scenario 1 and scenario 2 is a fact of type 3, the conditions under which Bill and Lorena would have stopped having children. Unless you believe that Bill and Lorena's desire for one gender has a material impact on the probability of boys vs girls, this fact is irrelevant to any calculation of posterior probabilities. And is irrelevant in classical Bayesian approaches.

Yet, despite being irrelevant, it matters a lot for frequentist approaches.


1 and 3 are really the same thing. If the complaint is that the frequentist test can't tell you anything if the assumed distribution wasn't the right one (which is what's happening if you would have done something different), consider the bayesian case. There one might argue you at least still have the probability of each hypothesis given the data. But that forgets that it is only the probability of each hypothesis given the data, given that those were the only possible hypotheses. And if you admit the possibility of their being other possible hypotheses, then these probabilities don't really tell you anything either.

Anyway, not sure if we're on the same page, but thanks for the discussion.


We are clearly not on the same page. Because I think that 1 and 3 are rather different things, and you don't.

In particular 1 consists of exact statements about the likelihood of 7 births in a row being mmmmmmf.

By contrast 3 consists of statements about what Bill and Lorena's childbearing plans would have been if something different had happened.

Those are very different types of statement. There is no connection between statements of type 3 and statements of type 1 unless Bill and Lorena's state of mind makes a significant difference to the odds of the next child being a boy.


Okay, my last try.

For Bayes' theorem, we need a theory of how the data is produced given the parameter of interest. Bill and Lorena's plans certainly influence what data I observe: in scenario two, I can never observe the data BBBBGGG, but in scenario one, I can. My point is that your first category is not "exact statements about the likelihood of 7 births in a row being mmmmmmmf", it is "exact statements about the likelihood of observing mmmmmmf", which is, in fact, quite different, if you admit the possibility of Bill and Lorena having particular childbearing plans.


If you include into the priors information about the likelihood of the next child being born, you will indeed get different absolute probabilities. But you will not get different relative probabilities unless your available priors create a correlation between birth order and the likelihood of different genders for the next child if it comes. And therefore the probability of having the next birth cancels out of Bayes' formula and you wind up with the exact same conclusions from the observed data.

You certainly DON'T wind up with anything like the factor of 8 difference that frequentist techniques will see!


Is this oddity just because scenario 1 fails to account for order?


Exactly. There are more ways to get this unusual outcome in scenario 1 than scenario 2.


If Alice just wants to know the average height of the population, why is she doing a hypothesis test that the height isn't 65 inches?

Since her hypothesis test is designed to help answer the question of whether or not the mean population height is 65 inches, why should we expect it to tell us anything about the mean population height other than whether or not it being 65 is consistent with the data observed?


Where is the "paradox"? I assume that you mean that by pure hypothesis testing both Alice and Bob reject the null hypothesis at the alpha=0.05 level, for example. The fact that the p-value is irrelevant is by design: you perform the test and either you reject the null hypothesis or you don't. But in fact most people won't do "pure" hypothesis testing and will conclude (somewhat incorrectly) that the evidence against the null hypothesis is indeed stronger in Bob's experiment.

Edit: and why would be the answer to the second question "no"? The hypothesis testing procedure doesn't provide any point estimate at all so the question doesn't really mean anything in that setting.


Thanks! This is one of the best explanations of the problem with using p-values.

Furthermore, I can recall many times when my stats professors would make it very clear what the "right" answer was, despite it being counter-intuitive but without explaining


Is the p-value really not the probability of your results being due to chance? Is that not a perfectly valid definition of it?

I suppose 'chance' is a little hand-wavy, but isn't a p-value just the probability of your data given that your hypothesis is false? Isn't that literally and precisely the probability that they occurred by chance?


Is the p-value really not the probability of your results being due to chance?

No, it's the probability of a particular observation, given that we assume the result is due to chance. This sounds similar, but the difference is that it doesn't say anything about the probability of your result outside the context of the study's hypotheses.


No it's not.

It's the probability of you seeing your results due to chance if there is no effect.

This differs from the probability of the results being due to chance because it does not take into account the probability of your hypothesis being true or false.

If you observe something that would disprove e.g. General Relativity with a P of 0.001 it is much more likely to be due to chance than if you observe something that is consistent with known science with a P of 0.001, as the weight of all the evidence for General Relativity is very strong.


I'm probably jumping into shark filled waters considering the point of the article is that defining p-values intelligibly is incredibly difficult even for experts, but here goes....

> Is the p-value really not the probability of your results being due to chance?

No, it is not. It is the probability that the statistical attributes of the data would be equal or more extreme than observed assuming the Null Hypothesis is true.

In your definition, there is no assumption of what model the data should conform to...so what does "by random" mean in that context? Also, by random doesn't mean 'not predictable' or useless. If I roll a 100 sided die an infinite number of times, I'd expect the number of observances of '5' to approach .01 of the total distribution. So, the probability of rolling a 5 by random chance is 1 in 100. However, I would not reject my Null Hypothesis since my model predicts exactly this random behavior.

Now, if I rolled a die 1000 times and rolled a 5 every time the mean of that distribution (5) would be very, very far from the expected mean of my model if the Null is assumed to be true. And I may be tempted (very) to reject the Null that I am rolling a 100 sided fair die.

I will now sit back and wait for my definition and analogy to be torn to shreds. :)


You have to be very specific about "probability that they occurred by chance." If p = 0.05, you can't say "there's a 95% chance this result is real and a 5% chance it's just a fluke." You can say "if chance is the only thing operating, we'd see a result like this only 5% of the time."

In conditional probability notation, it's the difference between P(result | it's just chance) and P(it's just chance | result).


Imagine I handed you a 20-sided die. I claim it says 7 on every side, but I might be lying. You roll a 7. What are the chances it actually has 7 on every side?

You can't actually say unless you either (1) roll the die more times, or (2) assume something about the probability that I gave you an all-7's die to begin with.

Doing (2) is useless, because that exactly the question we are trying to answer.

For example, suppose I perform this experiment all the time and I know that I give an all-7's die only 1% of the time. With this new information, you could actually calculate the probability of an all-7's die given a 7 roll. Of all the possible outcomes, you could add up the ones where I gave you an all-7s die and the ones where I gave you a normal die but you just rolled 7. Then you could divide that by the total number of possible outcomes.

But this would give you a totally different number than if I give you an all-7's die 99% of the time. And the problem is that you don't have any information about what kind of die you have before you roll it. You're trying to figure out which world we live in -- one where your hypothesis is true or one where it's not.

(I am pretty sure that what I wrote above is true. But one thing I'm not as clear on is how multiple rolls of the die actually can establish confidence percentages. How many rolls does it take to actually establish confidence? Would love to hear from any stats experts about that.)


rolling the die more times just lowers the P value. You still can't make a definitive statement on the probability that it's an all-7's die without assuming something about the prior probability.

However, if you can bound the prior probability on the low end, you can make meaningful answers. Let's say you think there's at least a 1-in-a-billion chance that you have an all-7's die. After five rolls in a row of 7, there's at least a 0.3% chance that you were handed an all-7's die. After six rolls there's a 6% chance, and so on.

Usually this quantitative analysis isn't formally done, since the priors can always be debated, but rather a very small P value is demanded for very unlikely events.


> I suppose 'chance' is a little hand-wavy, but isn't a p-value just the probability of your data given that your hypothesis is false?

It is the probability of achieving a result at least as far from the hypothesized value exclusively due to random variation that is uncorrelated with the explanatory variable(s) at hand [and subject to a number of other assumptions].


Ya, but is that not equivalent to the statement 'due to chance'? I think the common understanding of 'chance' would include 'due to random variation uncorrelated with the explanatory variables'.


> I think the common understanding of 'chance' would include 'due to random variation uncorrelated with the explanatory variables'.

Yes, but the problem is that it also includes more than that. One of the many problems with p-values is that people assume that the p-value encodes a lot more information than it actually does.

Another problem: it doesn't really tell you any "new" information. All equality-based null hypotheses are false, and we know this due to continuity theory[0]. Really, all a hypothesis test does is tell us if the sample size is large enough to reflect this knowledge. Literally any null hypothesis can be rejected, as long as the sample size is sufficiently large.

It's worse than that, though. Yes, hypothesis testing doesn't tell us anything about the practical significance of the results. But it also doesn't actually tell us what most people thing it does: a measure of how wrong our hypothesis is. Rejection is a binary state: there's no concept of "strongly rejecting" a null hypothesis[1].

[0] We could reject continuity on the basis that the real world is actually discrete at the quantum level, but a lot of the math underpinning hypothesis testing falls apart if you don't assume continuity, so it's broken either way.

[1] Many people - including statisticians - will sometimes imply this colloquially ("the p-value was .00001, so our hypothesis was completely wrong"). It's not always "wrong", because largely you can measure this concept in other ways, but you cannot do so with a p-value.


That's not what a p-value is. A p-value is the probability of getting by random chance a result at least as extreme as the measurement. This is not the same as the probability that the effect you measured is due to chance. The latter isn't even well defined without additional assumptions.


Could you please explain what you mean, what is the difference you talk about?

Roll a dice 100 times, and on average 5 rolls have the pattern I want.

v.s. I have found pattern and there is a 5% change it was due to random.

Is that not the same?


The things you're asking to compare aren't particularly clear, but I'm fairly certain they're not the same. The English phrase "due to" describes a causal connection between events, and p-values don't.


Nope! It's "the probability of chance* leading to something at least as extreme as your results". They're related, but they're not the same.

* Ok, a chosen null hypothesis, often chosen to be something like "chance".


Next step is explaining social science students the meaning of 'randomness' :)

Really, the amazing amount of bullshit social studies I have seen 'proven' by statistics. Amazing new insights like 'if children wear green shirts, while the teacher has a blue shirt the cognitive attention span is 12.3% higher than children wearing purple shirts. The effect was measured with a significance of bla bla bla.'

Software like SPSS facilitates this even more. People with no notion of random effects or probablilty theory click on the 'proof my research' button and even get it published.

So a lot more work in this area!


In undergrad, I learned how to about p-values but never quite understood how they were actually useful. Now, as a bioinformatics graduate student, I've come to understand that my original instinct was right all along.


Statistical significance is difficult to ensure. Certainly, one should be suspicious if 0.05 is ever used as a significance threshold, because it's unlikely that exactly one hypothesis under one regime was tested in any given paper.

I am glad that the article's headline is clear that it's time to stop misusing P-values. Tests for statistical significance should still be used, and to abandon them would be foolish. In a sense, though, they are the beginning, not the end, of assessment.


p-value analysis has its big caveats like multiple comparisons, but Bayesian has its own, such as it's extremely hard to calculate priors. Both are challenging to use in difficult analysis and both can be abused.


> such as it's extremely hard to calculate priors.

I think you mean it is difficult to formulate priors? Typically, calculating a prior only involves sampling from a distribution.


Let's use this example: http://betterexplained.com/articles/an-intuitive-and-short-e...

-1% of women have breast cancer (and therefore 99% do not).

-80% of mammograms detect breast cancer when it is there (and therefore 20% miss it).

-9.6% of mammograms detect breast cancer when it’s not there (and therefore 90.4% correctly return a negative result).

For generating a prior, I posit it's really hard to determine that 1% of women have breast cancer, and this prior is almost linearly sensitive to whether you actually have cancer:

prior | % likelihood of cancer

2% | 16%

1% | 7.8%

0.5% | 4%

Why do I think it's hard to determine the prior? Imagine you are a 34 yrs old woman in SF. Do you use the rate of breast cancer for only 34 yr olds? For the range of 30-40 yr olds? 30-65 yr olds? For women only in SF? In the bay area? In California? Data from 2000-2016? 2010-2016? This becomes hard because positive cases are rare (thank god), so the population of which you calculate the prior from can be noisy. Imagine if the positive result occurs 1e-5 times.. you are talking about 10's of cases every million. Vary the prior probability and your posterior probability will vary just as much.


hmm, what you are discussing above is typically called the formulation of the prior, not the calculation of it. And I agree that this formulation can be problematic. It is almost certainly the most contentious element in Bayesian inference.

However, I will note that your examples are largely overstating the problem. The prior is not typically as subjective as you are implying. Furthermore, the Bayesian prediction converges to the frequential result in the limit of 'large' data. If your results are so highly sensitive to your prior that the results drastically change, then you either have:

1) insufficient data, at which point the frequentist results would also certainly have been as bad or worse (the prior acts as a regularization) 2) a misspecified prior

Either way, it is absolutely required that you perform tests to ensure your results are not greatly sensitive to your choice of prior, or if so that this is clearly noted as a modeling assumption. And bayes naturally has many means to test precisely this, such as cross-validation, bayes-factors, hierarchical, etc.

I also note that this is why I typically prefer Jeffreys Priors in my work. I mention this so that one can see that not all priors are 'subjective'. These are difficult to use in some fields, admittedly.


Though technically, obtaining that distribution, if you're doing anything above "Eh, it's probably normal..." could be described as "hard".


The #1 problem is with p-values is the word "significant". We should use "detectable" instead. Significant implies meaningful to most people, but not in a statistical context. This is quite confusing. Detectable is better because the mainstream meaning aligns with the jargon.

So:

> "Discovering statistically significant biclusters in gene expression data"

becomes:

> "Discovering statistically detectable biclusters in gene expression data"

This rephrasing makes it evident that "statistically detectable" adds little to the title. So the title becomes

> "Discovering biclusters in gene expression data"

A better title.


This excellent and amusing article by Gerd Gigerenzer discusses the history of p-values and their (mis)use:

"Mindless Statistics"

http://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf

or

http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JS...


The article tends to imply P-value should not be used at all, rather than misused. P-value definitely means something. For example, if the p-value is 1e-10 (which is often possible), you know for sure that the hypothesis generating has been disproved. So let me rephrase the title of the article - "It's time to use P-Value correctly."


Wow, I remember having these reservations about p-values when I took classes in stats but whenever I brought them up a prof. would wave their hands and be dismissive. They gave me a degree in political science, but I felt that political science was an oxymoron and it left me with no respect for the field.


> They gave me a degree in political science, but I felt that political science was an oxymoron and it left me with no respect for the field.

For what it's worth, Andrew Gelman (quoted in the article) is one of the most pre-eminent Bayesian statisticians alive, and is a professor in both the department of Statistics and Political Science!

"Political Science" need not be an oxymoron, even if a lot of self-professed political scientists use rather unscientific methods.


...and I could not find the word "correlation" anywhere in this article discussion statistics, harm, misunderstanding, and studies. Shame.


If people would just learn to resample (cross validate, use subsampling or the bootstrap), we wouldn't be having this pointless discussion at all.


In theory, you are correct, but some data sets are too small or make it too difficult to cross validate in a way, that is meaningful to the original problem


This is one of the best conversation threads I've ever seen on HN. It's both polite and informative.

I want to toss in my own thoughts here.

Since I've spent the vast majority of my tech career in the Market Research industry (hello, bias!), I'm tempted to say that one of the most frequent intersections between statistical science and business decisions happens in that world.

Product testing, shopper marketing, A/B testing . . . these are pretty common fare these days. But I feel like the MR people are sort of their own worst enemy in many cases.

It's a fairly recent development that MR people are even allowed a seat at the table for major product or business decisions. And when the data nerds show up at the meeting, we have to make human communication decisions that are difficult.

I can't show up at the C-suite and lecture company executives about the finer points of statistical philosophy. When I'm presenting findings to stake-holders, it's my job to abstract the details and present something that makes a coherent case for a decision, based on the data we have available.

It is sinfully attractive to go tell your boss's boss's boss that we have a threshold--a number we can point to. If this number turns out to be smaller than .05, this project is a go.

Three months later, you go back to that boss and tell him the number came back and it was .0499999. The boss says, "Okay, go!" And then you are all, "Wait, wait, wait. Hang on a second. Let's talk about this."

My god, what have I done?

The practical reality of the intersection of statistics and business is a harsh one. We have to do better. In terms of leaky abstractions, the communication of data science to business decision makers is quite possibly the leaky-est of all.

Why is it so leaky? I have two points about this.

1) Statistics is one of the most existentially depressing fields of study. There is no acceptance; there is no love; there is nothing positive about it. Ever.

Statistics is always about rejection and failure. We never accept or affirm a hypothesis. We only ever reject the null hypothesis or we fail to reject it. That's it.

2) In business, we tend to be very very sloppy about formulating our hypotheses. Sometimes we don't even really think about them at all.

Take a common case for market research. New product testing. We do a rep sample with a decent size (say, 1800 potential product buyers) and we randomly show five different products, one of which is the product the person already owns/uses (because that's called control /s). The other 4 products are variations on a theme with different attributes.

What's the null hypothesis here? Does it ever get discussed?

What's the alternative hypothesis?

The implicit and never-talked-about null is that all things being equal, there is no difference between the distribution of purchase likelihood among all products. The alternative is that there is a real difference on a scale of likely to purchase.

The implicit and intuitive assumption is that there is something about that feature set that drives the difference. (I'm looking at you, Max Diff)

But that's not real. It's not a part of the test. The only test you can do in that situation is to check if those aggregate distributions are different from each other. The real null is that they are the same, and the alternative is that they are different.

All you can do with statistics is tell if two distributions are isomorphic.

Now, who wants to try to explain any of that to your CEO? No one does. Your CEO doesn't want it, you don't want it, your girlfriend doesn't want it. No one wants it.

So we try to abstract, and I feel like we mostly fail at doing a good job of that.

This is getting really long, and I don't want to rant. So to finish up, an idea for more effective uses of data science as it interacts with the business world:

I agree, let's stop talking about p values. Let's work harder and funnel the results of those MR studies into practical models of the business' future. Let's take the research and pipe it into Bayesian expected value models.

Let's stop showing stacked bar charts to execs and expecting them to make good decisions based on weak evidence we got from hypotheses we didn't really think about in the first place.

Some of this might come across as a rant. I hope it is not taken that way. This is a real problem that I've been thinking about for a long time. And I don't mean to step on anyone's toes. I have certainly committed many of the data sins that I'm deriding above.

Edited to add:

The real workings of statistics are unintuitive. I'm not saying that they are wrong. But in working with people for years now, I understand the confusion. It's a psychological problem. Hypotheses are either not really well though out or not considered in an organized way, in my experience.

A hypothesis is not concrete in many practical cases. It's a thought. An idea, perhaps. It's often a thing that floats around in your mind, or maybe you paid some lip service and tossed it into your note-taking app.

Data seem much more real. You download a few gigabytes of data and start working on it. It's quite easy to get confused.

I have real data! This is tangible stuff. Thinking of things properly and evaluating the probability of your data given the hypothesis is hard. Your data seems much more concrete. These are real people answering real questions about X.

Even for people who are really hell-bent on statistical rigor, this is a challenge.


If they held the same meeting 20 times, would they reach the same conclusion in 19 of those meetings?

On a more serious note, I think that the use of the word "significant" to mean "the effect is reasonably likely to exist by some standard" should be abolished.

Webster's 1913 dictionary says:

> Deserving to be considered; important; momentous; as, a significant event.

Statisticians don't use "significant" to mean important at all -- they use it to mean "I could detect it". This is bad when someone publishes a paper saying "I found that some drug significantly reduces such-and-such" -- this could just mean that they did a HUGE study and found that the drug reliably had some completely unimportant effect. It's much worse when it's negated, though. Think about all the headlines that say that some treatment "did not have a significant effect". This is basically meaningless. I could do a study finding that exercise has no significant effect on fitness, for example, by making the study small enough.

A good friend of mine suggested that statisticians replace "significant" with "discernible". So next time someone does a small study, they might find that "eating fat had no discernible effect on weight gain", and perhaps readers would then ask the obvious question, which is "how hard did you look?".

This would also help people doing very good research make less wishy-washy conclusions. For example, suppose that "vaccines have no discernible effect on autism rates". This is probably true in a number of studies, but it's the wrong analysis. If researchers who did these studies had to state the conclusions in a silly manner like that, maybe they'd find a more useful analysis to do.

Hint: doing big studies just so you can fail to find an effect is nonsensical. Instead, do big studies so you can put a tight upper bound on the effect. Don't tell me that vaccines don't have a significant (or discernible) effect on autism. Tell me that, with 99.9% confidence, you have ruled out the possibility that vaccines have caused more than ten autism cases in the entire history of vaccines, and that, most likely, they've caused no cases whatsoever (or whatever the right numbers are).

Edit: fixed an insignificant typo.


There is a significant (pun) difference between statistically significant and economically significant. The conclusion in the drug paper conflates the two. In finance, we could find plenty of statistically significant results (e.g. small cap stocks outperform large cap stocks on Fridays, with a small p-value if you like), but most results were not economically significant--they were not usable for a trading system because they were too small to overcome real-world costs. In short, they weren't meaningful in a real world sense, even though the result was detectable statistically.


That's the second definition in Webster's 1913 edition. The first is:

> Fitted or designed to signify or make known something; having a meaning; standing as a sign or token; expressive or suggestive; as, a significant word or sound; a significant look.

It seems to me that this is the sense in which statisticians talk about significance. It means that the results actually signify something rather than just being meaningless noise.


Interesting. That definition seems like a bit of a stretch in this context to me. Results of trials aren't "fitted or designed" to signify -- they are or are not significant by the p-value standard, and whether they are or are not is random (which is the whole point).

In any event, I suspect that, among most currently living English speakers, the second definition is what comes to mind.


There is a difference between effect size and significance. Things can be extremely statistically significant and have very small effect sizes.

But the problem is less here and more that people don't care to understand the models they are using to judge statistical significance. A p-value is simply a magic wand to wave over the data and bless it. Statisticians may tend to look at the data a lot more qualitatively - a p-value might tell you something, but much more important is: "How accurately have I managed to model this system?"

This is the larger problem lurking behind "p-hacking" and other colossal statistical fuck-ups: people don't understand the mathematical models they are applying, the limitations of the data, and often don't care to, as long as the veneer of having 'done something' can be applied.

This, again, is probably a product of people cranking out shitty papers to make sure that they keep publishing, to continue eking out grants; which, again, is probably a product of research science being generally underfunded for the demands placed on it.


While I agree that 'significant' has a misleading connotation, 'discernible' is also misleading. 'Statistically significant' just isn't any everyday concept, and trying to phrase it as one will encourage people to make mistakes. It's a complex concept: if we tried this experiment under a certain null hypothesis, then it'd be at least this improbable to see a result at least this extreme. The most I'd be willing to cut it down, after so much confusion in its actual use, is "subjunctively improbable", with the null hypothesis and the threshold left implicit. "Eating fat had no subjunctively improbable effect on weight gain." This sounds technical and fiddly, which I think is a feature: if you don't like it, don't base your reporting on a technical, fiddly concept.

"Eating fat had no discernible effect on weight gain" sounds like getting evidence against such an effect, but it's compatible with getting evidence in favor, that's just not as strong as some threshold. That evidence could be useful in a meta-analysis, or for a decision when waiting for more information isn't practical or economical, or if the potential gain from trying the nonsignificant treatment is high and the potential loss low. (I've seen "no significant X" abused this way. Nobody should try X -- it's unscientific!)


>Hint: doing big studies just so you can fail to find an effect is nonsensical. Instead, do big studies so you can put a tight upper bound on the effect. Don't tell me that vaccines don't have a significant (or discernible) effect on autism. Tell me that, with 99.9% confidence, you have ruled out the possibility that vaccines have caused more than ten autism cases in the entire history of vaccines, and that, most likely, they've caused no cases whatsoever (or whatever the right numbers are).

Checking the bounds of a 100(1-a)% confidence interval is exactly equivalent to checking for a p-value below a%.


To solve the mystery of the significance of the p-value:

If mankind were to make one unique experiment a p-value would be a useful tool, but if many experiments are realized and people hide the results only showing those results that they want to show for self-promotion, then selecting a part of the real information (purposely or not) is a lie.


This is extremely important and I'm not surprised the media is either ignorant, doesn't understand, or 'stretches the truth' to adhere to their worldview. Statistical significance has NOTHING to do with magnitude of difference!


>"Statistical significance has NOTHING to do with magnitude of difference!"

Here is the equation for a t-statistic (used for the common t-test):

d=mean(a)-mean(b)

s=sqrt(var(a) + var(b))/sqrt(n)

t=d/s

I see the magnitude of difference (d) right in the numerator. The t-value is then compared to the t-distribution to get the tail probability.


You are also normalising the difference by the variance, so the t-statistic has no units.


The claim the significance has nothing to do with the magnitude of a difference is just wrong though. It clearly does, this information is just merged with other information about the variance and sample size to get the p-value, which is compared to a threshold to get significance.


I think what the article meant to say was that, for the same number of samples, the p-value when you have delta=50 (the difference between groups), stdev=10 is the same as the p-value of delta=0.5, stdev=0.1.

Depending on the study, finding a delta whose p-value is significant does not necessarily mean that the size of the effect (i.e., delta) might be significant enough to be useful.


it's not "just wrong". On the contrary, it is asymptotically true. In a world of abundant data, that matters.


Please expand on this.


This is a fantastic point. Sometimes when someone hands me a study I will ask what is the effect size. In some studies, even if there was a discernible effect, there is no hope for it to be anything but a small effect.


Don't tell me that vaccines don't have a significant (or discernible) effect on autism. Tell me that, with 99.9% confidence, you have ruled out the possibility that vaccines have caused more than ten autism cases in the entire history of vaccines, and that, most likely, they've caused no cases whatsoever (or whatever the right numbers are).

Whoa. Are you making a statement of what you think it would take to convince a hard-line anti-vaccination activist (the most charitable interpretation I can think of for your statement), or are you making that statement on your own behalf? I am chagrined to see this statement in the top comment in this thread. Reading the fine article submitted for discussion is strongly encouraged here on Hacker News, not just reading the headline, and I think you may have missed the part of the article that says, "Indeed, many of the ASA committee’s members argue in their commentaries that the problem isn’t p-values, just the way they’re used — 'failing to adjust them for cherry picking, multiple testing, post-data subgroups and other biasing selection effects,' as Deborah Mayo, a philosopher of statistics at Virginia Tech, puts it. When p-values are treated as a way to sort results into bins labeled significant or not significant, the vast efforts to collect and analyze data are degraded into mere labels, said Kenneth Rothman, an epidemiologist at Boston University."

Discernibility of effects versus lack of discernibility of effects is not the main issue here. The main issue is what effects are even possible, plausible, or likely in the first place. One way in which the law of evidence as it has developed in legal procedure over centuries is stronger than the statistical processes used in scientific publications is that the law of evidence remembers that human beings can have biases. I am surprised to see a participant here on Hacker News holding to the strong prior that we must assume that vaccines have caused some cases of autism. Nope. There has never, ever, ever, been any evidence that suggests a causal mechanism such that vaccines could cause even one case of autism. Moreover, there is abundant evidence that the first persons who promoted such claims were trying to abuse the tort law system in England to shake down vaccine manufacturers for payments for imaginary harms.[1] The kind of thinking brought into the discussion here in the comment quoted above doesn't take into account the difference between "evidence-based medicine" and "science-based medicine,"[2] in which clinical trials and other statistical data are supplemented with information from basic science about what priors are plausible and which are plainly implausible. There is no effect discernible from vaccines in increasing rates of autism because there is, on the basis of multiple lines of evidence, no effect that is the tiniest bit plausible for such a claimed harm, and a very plausible line of evidence showing greed and desire for publicity on the part of the corrupt lawyers and doctors who introduced the claim.

[1] http://www.bmj.com/content/342/bmj.c5347

[2] https://www.sciencebasedmedicine.org/about-science-based-med...


I'm making an intermediate claim. I find that most studies that claim "A does not cause B" to be extremely unconvincing exactly because of the statistical methodologies used and the fact that they don't actually do any statistics that would imply that A does not cause B.

The upshot is that I (and I'm more scientifically minded than most) am less than fully trusting of the interpretations given to a lot of claims resulting from scientific trials. Furthermore, I think a large part of this issue originates in both awful language used in actual papers and in the fact that people aren't doing the types of analysis that they should be doing.

The fact that I so frequently see news exclaiming some clinical result and that the statistical analysis (even if done rigorously and completely honestly) does not imply what the news claims means that something is very wrong in the way that statistics is used.

At least in Physics (my field), people tend to be a bit more careful about it. Physicists may do somewhat silly p-value tests, but they mostly insist on p values very very close to zero, and papers tend to at least answer the right question (such-and-such effect is upper-bounded by some tiny number to five sigmas).

I will backtrack a bit, though. I dug up "Vaccines are not associated with autism: An evidence-based meta-analysis of case-control and cohort studies". They do establish bounds on the odds ratio. Good for them.


5% (1 in 20) is a pretty weak threshold to pass. Let's go 5 sigma (p < 3e-7) for discoveries and reserve 0.05 < p < 3e-7 for stuff we should take closer looks at.


> 5% (1 in 20) is a pretty weak threshold to pass. Let's go 5 sigma (p < 3e-7) for discoveries and reserve 0.05 < p < 3e-7 for stuff we should take closer looks at.

This would still end up leading to misuse of P-values. Let's say you're doing a genome-wide association study on several hundred thousand SNPs. The traditional threshold is 5e-8 (0.05 / 1,000,000 effective tests). So using 3e-7 for the threshold for "discovery", you'd count many things as discovery that shouldn't be so.

On the other hand, let's say you do a study with 20 people with cancer. You give 10 of them a drug, the other 10 a placebo. All 10 with the drug survive; all 10 with the placebo die. Your P value is 0.0002. This doesn't count as discovery, but clinically I know what my judgment is going to be.

This is all to say that the misuse of P-values does not just come from the threshold.


Exactly right! There is a lot of misuse.

A friend of mine is researching cancer threatment at a commercial lab. They make a lot of long shots. Lot's of PHD students are hired, and they let them work on 1 of those longshots.

Costs them milions of euro's, but potentially gives them bilions.

These studies survive on statistics. It is even tweaked by having cross validations where they leave out the worst bins, 'because they are extremes'.

Always be very suspisious about statistical proofs, especially from companies with great commercial interests.


Multiple comparisons? 1 million independent tests? Hello?

>Your P value is 0.0002. This doesn't count as discovery, but clinically I know what my judgment is going to be.

>Let's go 5 sigma (p < 3e-7) for discoveries and reserve 0.05 < p < 3e-7 for stuff we should take closer looks at.

^Means you should start a new trial with n > 20 given the same placebo/drug split.


In the scientific world, it is economically infeasable (cost/time) for 1 million tests for a given hypothesis. Even n > 20 can be difficult for certain studies. Bootstrapping the results to simulate 1 million trials won't fix the aforementioned issue either.


> Even n > 20 can be difficult for certain studies.

For some expensive experiments even n=3 is difficult to reach. "Should we do another replicate and go from n=2 to n=3, or should we hire another post-doc?" is a relatively common (rhetorical) question.


That's fine, but then your power is low, and nearly every result you get will be an exaggeration. The more stringent your p value threshold, the more dramatic your results must be to be significant; if your sample size isn't adequate, you'll only get significance if you overestimate the effect.

This is an enormously common problem even with current p value thresholds. It's part of the reason why you see dramatic "A causes B!" results followed by replications saying "well, only a little."

http://www.statisticsdonewrong.com/regression.html#truth-inf...


You can either have more dramatic results with more stringent p-value OR have less variance, aka more sampling.


I once obtained a p-value of zero (or more accurately, smaller than the numerical precision of a p-value in R) for a result that was, by design, meaningless.

It's a bad idea to use p-values as thresholds, regardless of where you put the threshold.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: