It's always interesting to look at past ages through the lens of the modern age. I wonder if, decades or centuries from now, the misuse of statistics and inadequate up-front documentation of research methods in science will be seen as a great scientific scandal, or scientific failure, of the 20th and 21st centuries. We should have known better, but many of us failed in these basic ways.
An NPR episode of Planet Money covered the "Experiment Experiment": http://www.npr.org/sections/money/2016/01/15/463237871/episo... and described how the Reproducibility Project was unable to successfully produce a significant percent of studies that they undertook to reproduce. There is legitimate concern that data-dredging leads to misunderstanding or misrepresenting of results; and even if some of the studies are valid science, it's clear that the documentation of the studies do not suffice to allow others to replicate it, which is the bedrock of science.
The episode above also discussed a number of psychological fallacies and errors that experimenters make. Imagine that you begin a study of 200 people, measuring some variable. You see promising results, so you decide to extend the study and add more people; you test another 100 people.
By adding 100 more people to your study, have you increased or decreased the likelihood that your results were due to statistical chance? Counter-intuitively, the answer is that you have increased the odds that the result was due to chance. You cannot add or remove people to a study while it is in progress, based on preliminary results in the study, without impacting its statistical validity in important ways.
These and other changes that experimenters make during the course of an experiment create a great risk of conducting bad science. I for one support any effort to encourage all research studies and methods to be declared ahead of time.
>"Imagine that you begin a study of 200 people, measuring some variable. You see promising results, so you decide to extend the study and add more people; you test another 100 people.
By adding 100 more people to your study, have you increased or decreased the likelihood that your results were due to statistical chance? Counter-intuitively, the answer is that you have increased the odds that the result was due to chance."
I'll assume you are talking about using a t-test to see if two groups are samples from the same population. The problem you point out has nothing to do with chance. It is that you started with a null hypothesis that your groups were independent samples from the same distribution, but by having the second 100 people sampled conditional on the results of the first 200, you have ensured this is not true.
Such research designs are just a way of making sure the null hypothesis is false. If you get rid of the incorrect attribution to "chance", I don't see what is counter-intuitive about it.
> but by having the second 100 people sampled conditional on the results of the first 200, you have ensured this is not true.
I still think this is counter intuitive, and you must have a lot of practice in the field to get an intuitive feel for such cases.
The next 100 people are sampled just as randomly as the first 200. So by adding 100 people, we are essentially re-doing the experiment with a larger sample.
So how could the original experiment be valid if it had had 300 samples to begin with, when the now-augmented experiment, which for all intend and purposes is the same experiment, isn't valid?
I am not defending the validity of the above argument, but I am defending that it sounds pretty damn obvious.
You have to think about the hypothesis you are testing. This would be that groups A and B (including both the first set of 200 and second set of 100 people) have been independently sampled from the same distribution. This hypothesis is used to calculate a prediction of the expected results.
Instead your data is forced to consist of a sample where mean(A)-mean(B)=delta, where delta>0, for the first 200 people. Knowing that, would you make the same prediction about the final result (after the data from all 300 people is in)? It would be the same as getting data from all 300 people at once and adding delta to the first 200 in Group A. Clearly you have specifically created a deviation from the original hypothesis by design.
Also, let me note I said nothing about validity. I don't consider testing a hypothesis different than the research hypothesis to be a valid scientific activity. Just because the null hypothesis is false does not mean your research hypothesis is accurate or useful, so it is pointless from a scientific perspective. To me, this is on the level of arguing whether the Holy Spirit proceeds from the Father, or the Father and the Son. The entire premise behind the discussion is flawed, but we can still discuss the proper application of reason/logic to the arguments flowing from this flawed premise.
It is counter intuitive because for the vast majority of people statistics and probability are misunderstood. Casinos make money, people play the lottery, people are nervous about flying but not motorway driving and people think smoking or eating too much cake won't harm them.
>By adding 100 more people to your study, have you increased or decreased the likelihood that your results were due to statistical chance? Counter-intuitively, the answer is that you have increased the odds that the result was due to chance. You cannot add or remove people to a study while it is in progress, based on preliminary results in the study, without impacting its statistical validity in important ways.
You might have a specific experimental setup and statistical test in mind where this statement is accurate, but in general it is false.
For example, imagine that you have a collection of 200 marbles all of which are either red or blue, and you wish to conduct an experiment to estimate whether the total number of blue marbles is greater than or equal to the number of red marbles. If you randomly sample 100 marbles without replacement you can arrive at some estimate of the fraction of each population (the exact methodology doesn't matter for this discussion). Having gained some confidence that your preferred hypothesis was true, you sample another 100 marbles without replacement, at which point you know the exact answer to posed problem (the statistical argument is trivial). It is also worth pointing out that you don't necessarily need to count every marble to arrive at this kind of result (if at some point you have determined that there are at least 101 red marbles you know the exact answer of the proposition).
The experiment described above is obvious academic, but it serves as a basic example of adaptive experimental designs that hone in on exact answers by way of pigeonhole principles. There are also statistically flavored extensions of these designs that allow one to assign subjective degrees of belief for different outcomes.
Uri Simonsohn's website[1] is full of links to interesting papers by Simonsohn and his colleagues that illustrate the dangers of "researcher degrees of freedom" and back up the points you are making here.
I would like to eventually see a "science results build system", where experimenters can register (preferably ahead of time) the analytical methods they will use to analyze their data, expressed as runnable code. Any data sets already available, on which they plan to rely, will be similarly registered and made available to the "build system".
When the study is complete, the researchers upload their data, in as raw a form as technology permits. All analysis performed on the data to achieve the results should be expressed through runnable code evaluated by the build system. Ideally 100% of this code would be registered in advance, though to be realistic, in practice we'd find some of it to be written along the way.
Whether the code is registered in advance or written and registered along the way, in both cases the build system will have the complete code and data set, and will be able to reproduce the results. Ideally the experimenters simply take the output of the build system as their actual results. The scientific community will have complete visibility into the analytical methods of the study, since all data and analysis will have been captured by the build system.
The build system would ideally be agnostic to tooling and data format. Perhaps it is something like a data store alongside a runtime platform, capable of taking a virtual machine image and a data set, and executing the VM with the data set as input.
Tagline: "GitHub for scientific data and runnable experimental analysis"
This reminds me of the recent Planet Money "The Experiment Experiment" and they pointed out some journals requesting pre-registered experiments (pre-commited experiment methods and analysis submitted to somewhere online before the experiment is run), it's at the 18-ish minute mark.
This sounds like a pretty great idea if I understand it correctly. My cynical view of the business of science says it'll never get traction but the base idea sounds like a great way to get rid of a lot of cheating methods. It would also result in research that I'd immediately trust.
Alas it's probably possible to game this theoretical system because you can still upload faulty data that leads to the desired results since the registered code can probably be reversed somewhat easily/would be open.
I think there's two issues (which can also be combined).
1) Fake/false data.
2) Fiddling with methods until you get the results you want given a fixed data set.
Your solution would be excellent for getting rid of problem #2. I think #1 is quite a bit harder and I haven't come up with anything better than "as open as possible"
An NPR episode of Planet Money covered the "Experiment Experiment": http://www.npr.org/sections/money/2016/01/15/463237871/episo... and described how the Reproducibility Project was unable to successfully produce a significant percent of studies that they undertook to reproduce. There is legitimate concern that data-dredging leads to misunderstanding or misrepresenting of results; and even if some of the studies are valid science, it's clear that the documentation of the studies do not suffice to allow others to replicate it, which is the bedrock of science.
The episode above also discussed a number of psychological fallacies and errors that experimenters make. Imagine that you begin a study of 200 people, measuring some variable. You see promising results, so you decide to extend the study and add more people; you test another 100 people.
By adding 100 more people to your study, have you increased or decreased the likelihood that your results were due to statistical chance? Counter-intuitively, the answer is that you have increased the odds that the result was due to chance. You cannot add or remove people to a study while it is in progress, based on preliminary results in the study, without impacting its statistical validity in important ways.
These and other changes that experimenters make during the course of an experiment create a great risk of conducting bad science. I for one support any effort to encourage all research studies and methods to be declared ahead of time.