As I recall, instead of "compatibility intervals" (or confidence intervals), other gainsayers of P tests have proposed simply making the existing P criterion more selective, like a threshold value of .01 rather than .05, which equates to increasing the sample size from a minimum of about 10 per cohort to 20 or more.
I suspect this will be the eventual revision that's adopted in most domains, since some sort of binary test will still be demanded by researchers. Nobody wants to get mired in a long debate about possible confounding variables and statistical power in every paper they publish. As scientists they want to focus on the design of the experiment and results, not the methodological subtleties of experimental assessment.
Raising threshold will not just reduce probability of false positive result, but also will raise probability of false negative. Social sciences are dealing with a complex phenomena and it maybe that there are no simple hypothesis like A -> B, that describes reality with p<0.05. While in reality A causes B, just there are C, D, ..., Z, and some of them also causes B, others works other way and cancel some of others. And some of them works only when Moon is in the right phase.
p<0.01 is good when we have a good model of reality which generally works. When we have no good model, there are no good value for p. The trouble is all the hypotheses are lies. The are false. We need more data to find good hypotheses. And we think like "there are useful data, and there are useless, we need to collect useful, while rejecting useless". But we do not know what data is useful, while we have no theory.
There is an example from physics I like. Static electricity. Researchers described in their works what causes static electricity. There was a lot of empirical data. But all that data was useless, because the most important part of it didn't get recorded. The most important part was a temporality of phenomena. Static electricity worked some time after charging and then discharged. Why? Because of all the materials are not a perfect insulators, there was a process of electrical discharge, there was voltage and current. It was a link to all other known electical phenomena. But physicists missed it because they had no theory, they didn't knew what is important and what is not. They chased what was shiny, like sparks from static electricity, not the lack of sparks after some time.
We are modern people. We are clever. We are using statistics to decide what is important and what is not. Maybe it is a key, but we need to remember that it is not a perfect key.
> Raising threshold will not just reduce probability of false positive result, but also will raise probability of false negative.
And this article emphasises that huge numbers of scientists and their audiences cannot correctly interpret negatives (false or otherwise).
The negative means: "Our data does not appear to show any correlation that met the criteria we chose to determine statistical significance"
This negative is frequently misinterpreted as: “We have shown with acceptable confidence that there is no correlation". (which does not inherently follow).
Therefore, adjusting the common p value threshold to P<0.01 while not correcting this widely-held cognitive error could (potentially) even worsen the problem, because people will encounter proportionally even more of the 'negative' results that they are already well-established as poor interpreters of.
The problem with p values is the opposite: in a complex system, null hypothesis is almost never true. Everything affects everything else. It's the magnitude of the effect that is important, because it can be too small to have practical implications.
If you do enough studies eventually you'll find a result with p value below 0.05 that will overestimate the magnitude of the effect by a lot, and publish it.
In social sciences if the effects aren't clear and readily apparent without quibbling over whether p<0.05 or p<0.01 is the right standard then perhaps the whole thing is a waste of time. If our experimental techniques are insufficient for dealing with multiple factors and complex webs of causality then why bother?
How clear or important the effect is has nothing at all to do with a p-value. I can have p < 10^(-10) and still have an effect that is so weak that it's meaningless. The confusion you're having so pervasive and a big part of the problem. You have to use effect sizes in order to measure this.
I think you're using a strange definition of p<x here. p<0.01 does not mean that the phenomena is perfectly understood, merely the likelihood that some independent variable affects some dependent variable in some direction.
For example, global weather is a hugely complex system which is far from fully understood. Let's take an incredibly simple predictive model of temperature: Still, even with the most simplistic model, say, predict the current temperature by just predicting the average temperature for the current season. That's not going to be very accurate, but if you run it for a year it will almost certainly be p<.01.
> Raising threshold will not just reduce probability of false positive result, but also will raise probability of false negative.
Not really. You raise the probability of an inconclusive result when you could otherwise have gotten a positive. If you interpret p > threshold as “null hypothesis is true”, then you are doing the statistics wrong.
In most cases, I think a better model would be to extract an effect size such that an effect larger than the size is ruled out by the study to some degree of confidence. Currently, I read about studies that conclude that “such-and-such had no significant effect detected by this study.” Concretely, this looks like “vaccines had no significant effect on autism risk.” This may be accurate, but it’s lousy. How about “vaccines caused no more than an 0.01% increase in autism, and a bigger study could have set an even tighter limit.”
Physicists regularly do this. For example, we know that the universe has no overall curvature to a very good approximation.
We already see this happening in the clinical literature though. Effect estimates that show a positive clinical outcome, but have confidence intervals that barely brush the null are described as having "no impact".
The example from the article about the two drug studies seems to indicate that would not be useful.
> For example, consider a series of analyses of unintended effects of anti-inflammatory drugs2. Because their results were statistically non-significant, one set of researchers concluded that exposure to the drugs was “not associated” with new-onset atrial fibrillation (the most common disturbance to heart rhythm) and that the results stood in contrast to those from an earlier study with a statistically significant outcome.
> Now, let’s look at the actual data. The researchers describing their statistically non-significant results found a risk ratio of 1.2 (that is, a 20% greater risk in exposed patients relative to unexposed ones). They also found a 95% confidence interval that spanned everything from a trifling risk decrease of 3% to a considerable risk increase of 48% (P = 0.091; our calculation). The researchers from the earlier, statistically significant, study found the exact same risk ratio of 1.2. That study was simply more precise, with an interval spanning from 9% to 33% greater risk (P = 0.0003; our calculation).
The OP spends some time on a point that the threshold is fairly arbitrary, and the problem is misinterpreting what it actually _means_ for validity and other conclusions.
I suspect just changing the threshold (especially as a new universal threshold, rather than related to the nature of the experiment) wouldn't even strike the authors as an improvement.
> Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention. It is based on the false idea that there is a 95% chance that the computed interval itself contains the true value, coupled with the vague feeling that this is a basis for a confident decision.
Part of the problem is that p-values are not the best indicator in all applications. One question in my work is whether a process change affects the yield. A confidence interval of (-1%,+1%) is much different than (-20%,+20%), even though they would look the same if I was just interested in the p-value. We might also accept changes with a (-1%,+10%) confidence interval. We can't 'prove' that yields would increase, but there is significantly more upside than downside.
Exactly, sample sizes will have to increase with the selective threshold.
I'm surprised the authors don't talk about 'practical significance' vs 'statistical significance'. Statistical significance can be easily gamed, especially if you're relying on one study. I think the real problem is the reliance on one study to make broad generalizations. The 'replication crisis' is everywhere.
I hope that you're wrong because it's solving the wrong problem.
A compatibility interval (at an agreed upon arbitrary level) communicates the magnitude of the difference as well as the uncertainty, which makes it much better for comparing options.
If my medication alleviates symptoms for 94-95% of patients above the current gold standard of 92-93% you could say it's "statistically significantly better", but the marginal improvement may not be worth the investment. Conversely if my medication alleviates symptoms for 50-80% of patients and the gold standard is 45-55% it would at least warrant future research (and if my medication has fewer severe adverse events it might be a better bet overall).
But this is just one small part of the whole picture: ideally we'd have preregistered experiments, experimental data published (or available to researchers where not possible due to confidentiality) and incentives for replication. Maybe this is too much for every field of science, but for ones where a wrong decision could have a severely detrimental impact they would create much more value than moving the P value.
'Nobody wants to get mired in a long debate about possible confounding variables and statistical power in every paper they publish.'
This is literally every paper I publish.
Also, tightening the p-value criterion has problems of its own - as mentioned, it boosts the false negative rate, which is not a consequence free act. In the work I do, it's also a largely arbitrary threshold I can meet if I give the computer enough time.
The fundamental problem is that 0 is a privileged value of effect size. So you can replace a p-value with confidence intervals, or credibility intervals (which are the same as confidence intervals as N increases to infinity, and the data dominate the posterior), or whatever, but it will always be appropriate (in the relevant scenario) to ask "is there an effect size at all?"
This is why these calls to eliminate significance testing always seem really naive and short-sighted to me. P-values are abused, and people confuse p-values and effect size, but there will always be a need to focus on 0 as that supremely-important number. ε can be judged on its practical significance but 0 is always less.
Anyway, I agree with you but wanted to point out that there's two sides to the coin, and both lead in the same direction.
I suspect this will be the eventual revision that's adopted in most domains, since some sort of binary test will still be demanded by researchers. Nobody wants to get mired in a long debate about possible confounding variables and statistical power in every paper they publish. As scientists they want to focus on the design of the experiment and results, not the methodological subtleties of experimental assessment.