Hacker News new | past | comments | ask | show | jobs | submit login
Is Economics Research Replicable? Sixty Published Papers Say “Usually Not” [pdf] (federalreserve.gov)
73 points by Gimpei on Oct 11, 2015 | hide | past | favorite | 17 comments



> The most common reason we are unable to replicate the remaining 45 papers is that the authors do not provide data and code replication files.

> We define a successful replication as when the authors or journal provide data and code files that allow us to qualitatively reproduce the key results of the paper.

Well this is underwhelming. I mean, sure, they're talking about papers in journals for which sharing data and code is required when asked for, and so they have definitely exposed widespread ignorance of the rules, maybe even a refusal to adhere to them... but the replication that is being talked about is "can we download the data, press a button and get the same results the original authors did?" and not "can we run the experiment again or run the analysis independently and get similar or the same results?"

Personally I like the the distinction between replication (a new experiment but with the same setup), reproduction (corroborate using different methods) and re-analysis (download the data, run the code, maybe do some additional analysis). This paper is entirely about re-analysis, not about replication or reproduction. (Cf. http://sequoia.cs.byu.edu/lab/files/reser2010/proceedings/Go...)

In one sense, failed re-analysis means research cannot even clear the lowest possible bar: you can't even check if the analysis produces the numbers that are mentioned in the research. But in another sense, whether or not researchers manage to release their code or not is only very weakly associated with how good that research is. Research might "fail" re-analysis because no code was provided, but survive both replication and reproduction.

The authors compare their work with the Open Science Collaboration which recently pointed out so many unreplicable studies in psychology, but this is not a fair comparison at all. The Open Science Collaboration was a huge endeavor and redid a bunch of experiments from scratch. This is just asking authors "give me your data" and checking a mark "did not replicate" if they didn't.


While I agree with the bulk of what you said, I do think it's important to understand that the difference with the Open Science Collaboration isn't really what you said. In my opinion it may be worse.

Being a (somewhat disillusioned) economist, I've read some of these papers but certainly not all of them. What I can tell you is that there isn't an "experiment" to replicate. These papers seem to be mostly or entirely computational macroeconomics and econometrics. In stuff like this, they design a model (a simple example is the real business cycle model), pump random data into it and see how different changes effect the model (i.e. the relationship between the volatility of unemployment with the volatility of output) and do those relationships match what we see in the data?

The replication as you define it (and I agree with the definition), should be pumping new random data into the model and still yielding the same results. However, it still leaves a few big issues. Such as, are these relationships really in the data? Because some of those relationships change depending on the time frame. So it the results may actually explain what occurred, but they shouldn't be used to explain what will occur later.

For reproduction and re-analysis, for this research, they probably need to go together. If we've defined a mathematical model, then it should be possible to program that model across platforms and software and still yield consistent results with different sources of random input data. And for verification, I think this is really important. Because I know I've messed up my programs before and gotten completely reasonable output that turned out to be incorrect. The authors didn't exactly describe how much they verified the programs were doing what they were supposed to do.

Honestly, I don't know that I can make myself care too much about the output results from the model until we can agree on what things are important for the model to show in the past, present, and future. And in academic economics, these important characteristics are almost canon and untouchable.


"<snip> then it should be possible to program that model across platforms and software and still yield consistent results with different sources of random input data."

This hits too close to home for me not to comment on. I do basically exactly this - redevelop models into production-quality code for broader deployment. I do this for 'closed' models as well (code that researchers do not have available for download from a website, for whatever reasons - mostly because they don't care, which is fine). Models being 'closed' this way does not make them 'black boxes' or 'not reproducible' - whatever the code does, needs to be described in the paper(s) anyway (the concepts, not the implementation details).

The way to do a baseline verification of the implementation of models is by having minimal synthetic data sets and doing unit tests on them. Usually people develop their model based on their full 20000-observation or whatever data set, with numbers with 15 digits etc. - the only way to spot mistakes in such an implementation is if they are several orders of magnitude off.

I once found a calculation in some Fortran code that mistook kilometers for meters (or the other way around, can't remember; either way, the result was that one component of the model was off by a factor 1000). This hadn't been discovered in 10+ years, by many users, some of whom (much to my horror) actually used this model to advise on subsidies for certain sectors. Now, it's not that the results where completely unreasonable, because someone would have noticed; it's the small mistakes that are the worst, especially when they are non-linear. Despite that and many examples like it that I have encountered, it proves to be nigh impossible to change software development hygiene of most researchers.


> they design a model... pump random data into it and see how different changes effect the model

?!?!

This is such an odd way to demonstrate results about a model. For hypothesis testing or preliminary research, sure. But as a result?

> If we've defined a mathematical model

...then the way that we establish properties about the model is by writing a proof.

What am I missing?


> This is such an odd way to demonstrate results about a model. For hypothesis testing or preliminary research, sure. But as a result?

So these are typically some combination of dynamic and stochastic difference equations that are supposed to mimic the real economy and may not have any analytical solution. The goal is to include different types of shocks, different inputs, and different functional forms and run simulations to see if the output shows the same patterns as the real economy.

What we tend to find in these models is that we can easily get some of the characteristics to match while others do not: i.e. matching the volatility in unemployment while also matching the volatility in wages, output, consumption, etc.

This is more along the lines of trying to be engineers rather than mathematicians. Which I'm fine with. My issue is (1) the supposed relationships we are trying to match are not guaranteed across nations or time. Thus even if the model is describing a certain period of time very accurately, it isn't useful at all in general. And (2) we take these models and get the resulting parameter values from the simulation that is at best good at describing some of the stuff happening in the economy and treat them like they apply to the real world (an example parameter might be the risk aversion of consumers in the economy).

We then use these parameters to describe what changes should be made to the economy in the event of certain types of shocks.

And if I'm being honest, my biggest complaint for these models stems from the idea that they all need magical micro-foundations in order to be considered realistic. This stems from the Lucas Critique and is likely the biggest travesty of economics in its history as a science. It is along the lines of saying general relativity isn't realistic, because it needs to be built using quantum mechanics.

What absolutely kills me is that many economists obviously compartmentalize academic research from what they suggest doing in the real world (this can be seen by the fact that the majority of economists polled supported the stimulus in the U.S., felt that it helped recovery from the recession, or at least kept it from getting worse).


One important side-effect of a an author providing data and code, even though it is only good for re-analysis, is that it provides completely unambiguous documentation for what the original author "did".

For example, if the original paper made a critical mistake, then there is no point is carrying out a replication, it won't work anyway. The Reinhart and Rogoff paper is a great example of this.

We really should require that ALL papers in ALL disciplines allow for re-analysis, it is basically like an extended, and ongoing peer review process.


One of my favorite meta-analyses in this vein is the inverted funnel of Doucouliagos and Stanley [1] which investigates elasticity in the employment market. In particular, fractional change in employment / fractional change in minimum wage. The idea is that so many studies have been conducted on this topic that if you construct a scatterplot by putting N (study size) on the y axis and elasticity (outcome) on the x axis, then you get a funnel (wide at low-N, narrow at high-N) that can (arguably) reveal selection bias since selection bias will disproportionately affect the wide, low-N side of the funnel.

Of course, this technique can only be performed if you have not only large studies, but a large number of studies -- so large that you can resolve an empirical distribution of outcomes at several different N bands. It's therefore limited to a small number of topics. Still, it's neat to get quantitative insight into an effect that is usually unobservable.

[1] pdf page # 33 of http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.398...


I find it kind of funny that people are surprised by this.

Reading the paper, it also feels like the authors derive entirely the wrong recommendations from this result.

Seems like they have identified a set of reasonable problems they encountered in trying to reproduce these studies. Fair enough.

To think that the solution is just to add a lot of documentation requirements on the researcher to address those specific problems just seems totally naive.

This would inevitably act as a form of regulatory structure on top of all the gruff people have to go through to get results 'published.' In doing so, this would also add cost and complexity to an already archaic system, and in a way that does not deal with the underlying root causes that are creating these problems.


The federal reserve study seems a little suspect. Not saying it is, but it's curious that it just happened to be 51% of the papers that were deficient. And, their methodology kind of glossed over their sampling strategy.

I wish the science community had better methods for documenting the entire scientific process and timeline. We live in an era where everyone has access to computers, but research results are still offered as paper documents. This provides a lot of opportunity to 'fudge the numbers' - to re-write the hypothesis and/or methodology to fit the intended outcome.

Wouldn't it be much better if research results were stored in standardized file format that showed an unalterable timeline of all activity?



In the end of the day, people need what they feel a better way to think and discuss before doing. It's only related to what options they have at the moment.


Of course it's not replicable. There's simply too many external factors that can mess up any well meaning experiment.


So basically you're saying those experiments had no predictive power in the first place?


They have some. But its not about predictive power, its about replicable experiments like you have in the sciences...

Economics is a useful social science (it's what I study), but it has challenges that the more established sciences don't encounter...


First question is, did anyone claim their models would have predictive power, and what sort of uncertainty margins did they set. Economics and other social science modeling is not 'R square > 0.9, let's go to the pub!' (apart from the fact that a regression model is not even a model as it applies to most processes from these fields, but that's a different discussion...).


On the contrary, they predict the data they studied extremely well :)


dismal science




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: