I've been grousing about the state of science lately in some other comments; this is basically what I was talking about. It's still the best method to the truth we've got, but it needs some upgrades.
One other big error I'm reasonably confident about (though I'd welcome corrections from those who know more stats) is that the p-value, in addition to its other faults and misinterpretations, is usually used in a manner that assumes a Gaussian distribution. While the Central Limit Theorum does mean we tend to see that more often than some other distributions, it is not true that it is safe to simply assume your data is Gaussian. You really need to demonstrate that it is, first, then you can start using Gaussian-based tools on the data.
I am wary of trying to contribute to this (a little knowledge is a dangerous thing) but hopefully someone can correct my mistakes.(And what better way for me to learn...)
A typical controlled science experiment is designed to take measurements of multiple groups where one variable is different between the groups and others are controlled. We wish to see if the variable of interest has an effect.
Therefore the commonest statistical test is to determine if the mean value of the groups are different (t-test for two groups, anova for multiple groups) e.g. does mean blood pressure increase when on a high salt diet.
If I understand the CLT (big if) then the distribution of the _mean_ of a sample is, by the CLT, going to be Gaussian, regardless of the distribution from which the actual measurements are drawn. i.e. for comparing group means it it doesn't matter if my data is sampled from Guassians or not.
Of course that leads to the question of if a significant difference in group means is really relevant in a given context.
"If I understand the CLT (big if) then the distribution of the _mean_ of a sample is, by the CLT, going to be Gaussian"
Yes, for an infinite number of samples. The rate at which is converges to a Gaussian, though, is strongly dependent on the distribution from which the measurements are drawn.
This is also only true if the distribution of each sample is the same. If you don't understand the system you're sampling from, or where the variability comes from (maybe its systematic), then all bets are off.
But moreover, in most situations, the other variables cannot be absolutely controlled but rather are only roughly controlled. How roughly depends on the field and is often the rub.
Absolutely true, but this is more a criticism of experiment design than statistical methodology.
Also it is bias in uncontrolled variables that is most concerning. (More than uncontrolled noise although that will reduce the power to detect real differences).
Over the years, hundreds of published papers have warned that science’s love affair with statistics has spawned countless illegitimate findings. In fact, if you believe what you read in the scientific literature, you shouldn’t believe what you read in the scientific literature.
"There is increasing concern," declared epidemiologist John Ioannidis in a highly cited 2005 paper in PLoS Medicine, "that in modern research, false findings may be the majority or even the vast majority of published research claims."
I'm not sure that it's fair to blame this fact on statistics. Incorrect statistics can give rise to illegitimate findings, just like improper measurement technique can. That's not the fault of statistics, it's the fault of people incorrectly applying tools and techniques.
That said, I agree with the conclusion that published research is full of incorrect results. I basically don't believe anything that comes out of a single paper unless it's been confirmed by unrelated work. That's the fault not just of incorrect statistics but of many other things, not the least of which is that doing research is hard and there is no way of knowing your end result is right.
Some is just misuse of statistics, but there are some stronger claims from various corners that the general meta-procedure is unsound. In particular, scientists get to pick among a number of different statistical tests, and scientific results using very different choices of statistical frameworks then end up aggregated: someone finds out X about heart attacks using method y, and someone else finds out X' using method z, and now medicine is said to know both X and X'.
More formally, most attempts to "lift" statistical inference into logical inference are unsound. A lot of scientists have, at least informally, a mixed model of: we acquire facts statistically (via e.g. hypothesis tests), then once we have those facts, we reason about them logically (using the usual rules of logical argument). But if you try to formalize this, e.g. building a system that derives facts from data by statistical hypothesis testing, and then uses first-order logical inference on the resulting fact base, you quickly get paradoxes.
The obvious fix is to (a) require raw data to be published; (b) require journals to accept papers before the experiment is performed, with the advance paper including a specification of what statistics were selected in advance to be run on the results; (c) raising the standard "significance" level to p<0.0001; and (d) junking all the damned overcomplicated status-seeking impressive nonsense of classical statistics and going to simple understandable Bayesian likelihoods.
Suggestion b is both radical and very thought provoking. Which, at this point, I expect from Yudkowsky.
I think in some fields achieving (c) is almost impossible - the n number needed would be too expensive for most labs working with animal models, for instance. (b) would be interesting, especially if it led to a culture where "failed" results are still worth publishing because they disprove something. I think (a) is the one worth focusing on first, since that seems the most achievable and would let you back-apply some of the other improvements at a later date.
Interesting. The idea that null results are not worth publishing has since the very beginning struck me as one of the most fundamentally flawed ideas in science. Interestingly, it seems to be very domain-dependent. In my field (astrophysics), null results are fairly frequently published, but I've heard that in other fields it's totally impossible to do.
I've run across one or two compiler optimization papers where the conclusion given was that the proposed technique didn't work out so well, but on the whole, it seems it applies there as well. I agree that it's a problem -- if a null result is not published, other people will probably waste years making the same mistakes.
Yeah, the wasted repeated effort was what I initially thought about, too. The argument about how compilations of results will be systematically skewed by certain results not being published is perhaps even more persuasive, because it doesn't just lead to wasted effort but to incorrect results.
I'm not sure that it's fair to blame this fact on statistics.
That's certainly true. The only thing you can say here about statistics is that it's hard, much harder than it looks.
Popular media reports with phrases like "Over 80% of..." create the impression that statistics can be encapsulated in a simple number, but that's far, far from true.
Correctly phrased, experimental data yielding a P value of .05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct). But many explanations mangle the subtleties in that definition.
This is important, but not quite accurate. It would be more correct to say that this is what the P value means if the model is correctly specified. And in the social sciences (or, for that matter, biology), this is almost never the case.
When I was an undergrad taking econometrics, this was incredibly frustrating. I swore there had to be something I just wasn't getting; why did scientists put so much credence in numbers that rely on assumptions that they know to be false? Of course, at the same time, I love microeconomic theory, which lies on a similarly fictitious basis.
Over time, I relaxed a bit in my attitude toward statistics. While I don't mean to diminish the importance of proper, rigorous methodology, the fact is that statistical methods are just a narrative device. They give us a way of telling plausible stories and discarding implausible ones. We'd be foolish to believe that we can always tell correlation, causality and coincidence apart, but we do a better job by using statistics than we would without.
Indeed. I had a similar realization when I observed that the estimated parameter error on a chi-square fit does not depend on the actual chi-square value itself. This seemed preposterous to me, shouldn't the parameters be more uncertain if the fit is bad? Then I came across this passage in Numerical Recipies that said something like "remember that all of this is under the assumption that the model being fit to is actually the one from which the data points are drawn. If the reduced chi-square value is >>1, then that indicates that this is not the case and then the entire procedure is suspect."
I remember recently reading a study that looked at possible correlations between political beliefs and squeamishness. They asked about 20 questions in each area, and found that 3 of the political questions correlated to 8 of the squeamishness questions at the 95% significance level.
Recall that 95% significance means "5% chance this correlation is a fluke". Correlating 20 questions with 20 others gives you 400 possibilities; they found 24 correlations. Given 400 trials, if there was no real correlation, they should've found about 20 flukes.
Needless to say, I was overall unimpressed by their results.
The usual rule of thumb that I hear is this: don't trust any single medical study. Look at the broader literature in the area and see if there's a general consensus. Until there's some consensus, be wary.
"The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation"
If you have only studied first year statistics this is what you will learn as "science". As soon as you get to subsequent study, you realise why hypothesis testing is almost always the wrong approach.
I'm not sure who this guy has been talking to, but it sure hasn't been statisticians!
I must have entered a time portal - that article isn't published until the 27th March!
I totally agree that translation of results usually makes the jump from 'statistically significant at a p < .5' to 'fact' - and that's where the problem is.
I would split theoretical ability and statistical ability into separate skillsets. Having the understanding in the field and the ability to posit reasonable hypotheses to test is very different (at least in most fields!) from having the statistical knowledge and ability to properly generate results.
And I think larger labs do frequently separate out a few of these. I'd love to have a better understanding of how industry labs differ from academic labs in that respect.
I'm doing a master's degree in science right now. I've previously worked as a telecomm engineer. I agree that there is relatively little division of labour in science, compared to engineering; or at least the labour isn't divided in the same way. I'm not convinced that this is bad, and in fact it's probably necessary.
Most of the useful scientific results synthesize previous results from various fields. There are very few observations incoming that actually add to or question the prevailing theories of a given field. It seems to me from my short experience that most of the work is in reducing theories, and experimentally confirming the reductions. To be useful as a scientist, there's a good change that you'll need to be an expert in at least two things.
I don't think it should be surprising that generating novel knowledge requires substantially more overhead than does acquiring and exploiting it (eg. engineering, or any practical pursuit where division of labour works so well). This may not apply to the medical research that was mentioned in the article, (the system is too complex for theoretic reduction to be useful) but I doubt that physical, ecological, mathematical, etc. sciences would progress as quickly without their practitioners having broad knowledge of their subjects.
I think that's a bit unfair. Researchers work in teams and divide labor just like everyone else. The blunt truth of reality is there are more ways to be wrong than there are to be right in both science and business.
One other big error I'm reasonably confident about (though I'd welcome corrections from those who know more stats) is that the p-value, in addition to its other faults and misinterpretations, is usually used in a manner that assumes a Gaussian distribution. While the Central Limit Theorum does mean we tend to see that more often than some other distributions, it is not true that it is safe to simply assume your data is Gaussian. You really need to demonstrate that it is, first, then you can start using Gaussian-based tools on the data.