Hacker News new | past | comments | ask | show | jobs | submit login
Groundhog: Addressing the Threat That R Poses to Reproducible Research (datacolada.org)
161 points by snakeboy on Jan 7, 2021 | hide | past | favorite | 120 comments



Either I'm misunderstanding or this is a non-problem. You can specify older versions of a package when you install it. You can also manage them with packrat. As long as researchers share their language and package versions, you can fully reproduce their environment. (And the base language is really stable, almost to a fault.)

This is just a bad way for the author to promote their own library for dealing with this. The way their library seems to approach this (using dates instead of versions) seems horrible too - on any given date I can have a random selection of packages in my environment, some of them up-to-date, some of them not. So unless all researchers start using the author's library (and update to the latest versions of everything just before they publish), it's only making things worse and not really solving the problem it claims to solve.


The impression I get is that this tool has a forensic bent to it. You ask for the code for a paper and Joe grad student with no programming knowledge emails you a zipped folder of R scripts. He just finished his dissertation, is starting a new job somewhere on the west coast, and no longer has access to the computer in his old advisor's lab where he did the work. The implementation may (or may not be) be lousy, but the use case sounds plenty valid.


Yeah, this seems half-thought-through. renv works at project level and isolates the dependencies of a project from your main library. groundhog.library() tramples over your library installing multiple versions. It also has the "cute" feature of auto-installing libraries if they aren't on your system already. Yuck. If you really wanted this script-only solution then you could go with the `versions` library, which already lets you specify an installation date.[1]

[1]: https://cran.r-project.org/web/packages/versions/index.html


Fully agreed. Add `sessionInfo()` output as an appendix to your publication. Should not be too hard to rebuild from that.


Correct me if I'm wrong but specifying an older package version in R still pulls the newest packages from CRAN for any dependencies, which is quick way to run into a load of incompatibilities.

I've not tried renv yet but packrat was a pretty poor solution.


If you need to use an older version of a package and don’t have a packrat/renv lockfile already, then packrat/renv are not going to help you. mran/checkpoint could though.

I agree packrat (which I created) was a poor solution for most users. renv is far better and more usable.


You can specify version numbers in renv, you take snapshots of your dependencies into a lock file and can always restore from there or make a new snapshot.


I full agree. With version number you can have a good sense whether a package update breaks your code or not (as long the package authors following semantic versioning).

I think in julia this problem is solved quite nicely with the Project.toml (list of package that you directly dependent) and Manifest.toml file (the version numbers of the complete dependency tree which is automatically generated).

It seems that in groundhog you declare only direct dependencies. Is there a way to store the full dependency tree in R ?


Using renv (or packrat) generates a renv.lock file with the version of every dependency.


Not really relevant but just to note packrat had been soft-deprecated and is superseded by renv, which comes standard with rstudio.


Packrat is deprecated now. It is recommended to use renv instead.


Thanks, I just checked renv out (I haven't had to work with R in over a year). Renv looks much better than packrat at a first glance.


Look at recent Rstudio videos on youtube, they explained why they moved away from packrat.


Apart from all the other considerations and problems with various types of package management, consider this:

"Update January 6th, 2021 A reader alerted me to a bug with the current groundhog (version 1.1.0) where you cannot set the groundhog library to be a folder containing spaces in the name."

So we are talking about software here that somehow made it to version 1.1 *without anyone ever using a directory with spaces in it with it". This can be interpreted in two ways: either very few people have spaces in their paths, or very few people have actually ever even tried (not even really used, I'm only talking about the most basic trial use) this package. I'm not a betting man, but if I were, I know where I'd put my money...


As I can see from the researchers in our cluster and my own academic research, most people still avoid spaces in paths and files like the plague.

YMMV of course.


As a Linux user I can relate to that. I always avoid spaces in folders and filenames as they make it more annoying to manipulate them using command line tools. Years later I carried this habit to whatever OS I am using.


If my own hobby python projects are anything to go by, there aren’t even folders ;-)

I have a friend who taught herself R for her research and it was basically one big procedural codebase.


Best way to know where every bit of code is: put it all in one source file.

Sarcasm aside, I've worked with codebases like that- thousand-line java methods and classes and the like. The problem is that there's nothing that really forces modularity on a codebase. There isn't even any consensus, objective way to modularise code. Otherwise, a machine could do it and we wouldn't have this kind of problem. But, a machine cannot, and so we do.


Of course, and so do I. But nobody ever even encountering the situation and/or bothering to report it, that's a whole different matter.


My guess is people are encountering the situation, working around it and calling a day. Maybe a little note here and there but, I don't think someone would report it due to a couple of reasons.

First of all, I don't think people report this type of stuff because they don't know how to report it, and secondly think it doesn't need to support this use case anyway since space is a latecomer to naming and path game.


Don't remember the source and probably misquoting, but I like this truism: there's software that people complain about and software that nobody is using.


The original quote is from Bjarne Stroustrup, the creator of C++. The quote also doesn't apply here. (You can't just use it to excuse any problem with software that you come across). The author of the article and the library in it just seems out of their depth in many ways.


> there's software that people complain about and software that nobody is using.

> The original quote is from Bjarne Stroustrup, the creator of C++

i find this ironic, given the 'popularity' (either way) of C++


I don't think it's ironic, the quote directly addresses the many criticisms towards C++.


ah whoops- completely misread it


> This can be interpreted in two ways: either very few people have spaces in their paths

it's been years since I've seen anyone doing that - a main reason, is that a very widely used dev tool, make, does not handle spaces in paths:

http://savannah.gnu.org/bugs/?712

thus leading to inertia in the whole ecosystem - if make does not support spaces in paths, why bother


> So we are talking about software here that somehow made it to version 1.1 without anyone ever using a directory with spaces in it with it.

This is extremely common, especially on Linux. Basically anything that uses things like Bash or CMake will almost certainly not work in directories containing spaces.

Developers don't use paths containing spaces because it causes so many issues with badly written Bash scripts, and as a result they don't test their code with paths containing spaces.

Bash and CMake and similar hacked together languages have very error-prone quoting rules that make it very easy to accidentally make something work with paths without spaces but fail on paths with spaces.


> Developers don't use paths containing spaces because it causes so many issues with badly written Bash scripts, and as a result they don't test their code with paths containing spaces.

It is also a PITA to use when typing in a shell, as you need two characters ( \ + space ) instead of one. So even though my scripts can handle them, I still avoid them if possible.


Some programs also use URLs

Today I wanted to send a screenshot by mail.

Should be simple, but with not Gnome. I make the screenshot, Gnome creates a file "Screenshot from ...", but does not tell you where. Then I search it in the file explorer, find it, copy the path. Then I paste the path in the mail program, file:///....Screenshot%20from%20. Then the mail program: "File not found"


If you start discarding software which has problems with a space in a directory name, you should start with libtool, at which point you can't build significant chunks of the Linux ecosystem.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=193163

I hit this when trying to test libgmp (as an example of an important library you would lose).

This means in practice you can't really build most software which uses configure scripts and libraries in a directory with a space -- this may well be what they are hitting.


It doesn't even seem to be on GitHub, in fact the source doesn't seem to be listed anywhere on the project website.

Which in our world would scream 'complete amateur, avoid, avoid, avoid', but perhaps it's different in the R world.


No, I think you’re correct. Incomplete source is bad in any world.

Unfortunately, it’s that world we live in for pretty much everything.

Reproducibility? What if all of the source were to depend on part of a CPU instruction set that we stop using? How long must things be reproducible? We don’t even make lab equipment exactly like we used to with the experiments our current sciences are based on.

However, I give a thumbs up to Groundhog for trying to do the right thing.


Reproducibility down to CPU bit differences is a sign that you did something wrong. Usually calculation with insufficient precision and no thought given to the range of simulation error. Simulation must be treated like a measurement, there is a maximum precision for your instrument and you have to know and apply it.

And even if you might disagree for the single-threaded case, most things running in parallel will eat that free lunch of bit-identical results due to timing differences.



While this specific project does have a github page, the R world is 'complete amateur, avoid avoid avoid'. It's not really a 'programming language' in the way software engineers would see it. It's more a loose collection of stats functionality that is tied together with text interfactes in a way that somewhat looks like programming to the uninitiated. I mean, batch scripting is technically 'programming', and Excel (even without VBA) is technically Turing complete, but neither of those would be considered 'programming' by software engineers, at least not under an intuitive understanding of what 'programming' is. (by that I mean, it's easy to be pedantic and argue that R and batch files and Excel files are 'programming' because of [xyz] where [xyz] will probably involve real 'definitions' and selection criteria etc; but despite those tools being useful, you can't do real software engineering in them, which you sometimes want/need).


This argument seems elitist. R is more than just technically Turing complete.

It's definitely a specialized language. It's not the go-to for managing servers or anything with a lot of I/O, but it has those capabilities because they're useful for managing projects. And I'd be hard-pressed to justify using a language for statistical analysis if it doesn't focus on statistical analysis. It'd be like rolling my own cryptography.

You need to differentiate between "base R" (everything that comes with a new install) and community-contributed packages. Base R is amazingly reliable. It has detailed documentation[0].

User-package land is more of a Wild West, that's true. I would personally not use anything that's not on CRAN unless I can walk up to the maintainer's desk (in non-pandemic times).

[0] https://cran.r-project.org/manuals.html


shrug. It's largely opinion-based, I guess. My pet peeve (which also illustrates my point, but again, in an opinion-based way): there is no documented, 'officially supported' way to get the path of the current script in R. That is not a problem for amateur programmers who don't think about things like robustness, distribution etc, and it's needlessly complicated and bolted on in SAS, too. But it's still silly and indicative of R's typical use cases. Excel is reliable and well documented too, and I still wouldn't call even complicated workbooks 'software engineering'.

And CRAN... well... let's just say that people used to point to CPAN as a strength of Perl, too... All that sort of archives, after the first few years which comprise mostly of contributors with deep knowledge and who can produce high quality libraries, turn into dumping grounds for trivial half-assed 'libraries' under the guise of 'community contributions'. Example: try to do trivial compound interest simulations in R. So basic that it's barealy worth calling 'finance'. There are (at least) three packages on CRAN that claim to do this, except that (depending on which variable in the equation you want to solve for) they all provide only part of the solution, in mostly incompatible ways. And this is because very few of the people putting code into CRAN know how to... well... write good code. This is not an indictment of those people; many of them are much more intelligent than a bunch of us combined. It's just that for them coding is a byproduct, and with good intentions they share what has been useful for them, it just leads to a situation of 'in the land of the blind one eye is king'.


> you can't do real software engineering

This is completely, 100%, absolutely wrong.

Of course you can. There's packages, with excellent software engineering structure, that are designed to include documentation and tests.

R has so much good software engineering, that clever people with no software engineering background can easily make their own packages!

And come on, the R language is a masterpiece. It's not cobbled together like JavaScript or bash. It's got impeccable functional programming language pedigree, you can even look at the AST directly of a function directly inside code.

I'm not sure how you came to any of your conclusions, other than not bothering to understand the language to start. It's a beautiful language with a messy, user contributed set of stats code.


> Of course you can. There's packages, with excellent software engineering structure, that are designed to include documentation and tests.

For me, the problem with R is that the language is inconsistent. Many packages arose to address many problems, but they all feel like a hack on top of the core language. Take the whole Tidyverse; it just does dataframes from R core but then from the ground up. Now, users can choose between the core language dataframes and the Tidyverse dataframes. Same holds for plotting. The core issue, I think, is that the core language misses some essential features which other languages do have nowadays. For example, a type system. In R, since types are missing, everything is a table (dataframe) which I find just weird.

> It's not cobbled together like JavaScript or bash.

But also not as good as my favorite: Julia. Comparing it to Bash is like saying that its better than COBOL. We all know Bash is quite old, but for certain situations it just works.


The tidyverse is the benefit and the curse of metaprogramming, something that R takes from lisp, and something that has cursed (helped?) C++ since it was added.

As far as type systems, there's really two different types of "types": individual types objects that can have generic functions attached to them, etc. This is not as well known, and there are actually several object systems for typing:

http://adv-r.had.co.nz/OO-essentials.html

But these sort of objects are not quite as commonly created by programmers, because the second type of "types" are much more useful: data frames, which is kind of a vectorization of structs. This is what would be used in data oriented design, which is apparently much more common in modern game design.



A further concern: the repository for this R package [1] doesn't include any test files. Am I right to think that we should be wary of R packages that don't have any unit tests?

https://github.com/CredibilityLab/groundhog


Could also be that the package manager doesn't use spaces and most people use package managers?

Ie maven will create a folder structure like "/home/user/.m2/repository/com/example/example.jar" which will never have spaces unless the username has spaces (Can linux usernames have spaces?).


On Unixy systems, spaces are uncommon because so little software can deal with them, so that people are trained from the very beginning to treat spaces like the plague. I do it too - I've been burned by treatment of spaces in shitty 0.x level software so many times (25+ years ago) that I now have an intuitive aversion of anything with spaces.

Spaces in filenames are a reality though, especially on Windows (where the home directory itself used to have spaces in it, and also where many home directories on corporate networks are on network drives and start with \\), and any software that can't deal with those kinds of paths has just not been exposed to much (if any) real world use. That was the point I was trying to make - software that can't handle anything but the most bog-standard path names in its core configuration is 'hey guys look at what I hacked up yesterday evening' quality at best. (yes yes it is possible to imagine exceptions, like software that is decades old and ported across platforms; I'm talking about something new that is meant to solve a general problem).


No, the R package manager can tolerate spaces in filenames.


I know of two other existing solutions to this, although I don't know enough to compare. I don't think either of these tick all the author's boxes.

Microsoft MRAN https://mran.microsoft.com/

> For the purpose of reproducibility, MRAN hosts daily snapshots of the CRAN R packages and R releases as far back as Sept. 17, 2014.

MRAN doesn't seem to be very well known or used in the R community, but I don't really know why?

Separately, Nix https://nixos.org/ also solves this problem for lots of different languages, but is difficult to get started with and still a bit rough around the edges. Probably not a good recommendation for a typical analyst or academic at this point.


The article discusses MRAN in footnote 5, when arguing against the MRAN-based 'checkpoint' approach.

Nixpkg/Nixos is obviously a useful technology for reproducibility, but note that the output of Nix scripts can depend on the time the system was built, the contents of URLs and the system architecture unless care is taken.


This is misleading; empirically, nixpkgs is about 99% [0] reproducible already. We know that the main variance is between language-specific behaviors; Python, Rust, and C all are prone to reproducibility problems.

In general, we want the output to depend on the system architecture and the contents of URLs. Nix uses hashes to require that URL contents don't change over time, which protects from those contents changing arbitrarily.

[0] https://r13y.com/


The current community around NixOS and Nixpkgs handles these issues just fine, but if 'just use Nix' was regarded as a magic bullet for reproducibility in science, I'm guessing it wouldn't work out so well.


Fortunately, "just use Nix" doesn't do much on its own. People usually want GCC or another complete C toolchain, a C standard library, etc. and this implies that they will use nixpkgs or one of its forks. If people try to "just use Nix" in anger, then they will almost certainly be funneled into using nixpkgs as a matter of practice.

The main problem with reproducibility in science is that most scientists are not actually interested in doing science. Of course software will not fix this problem.


So it does, I missed that!


It looks like this is much more fine grained compared to mran, i.e., with groundhog, you select the date vs with mran where you use the last (often > year old) snapshot.

mran is a great idea and if Rstudio (the defacto gate-keepers of the faith -- with Hadley the high priest) pushed to use mran, then the R community would follow suit (like they do for everything else).

This would do a lot to bring MS into the fold, which would actually be great for R.


They have their own package management library

https://rstudio.github.io/packrat/

and sell their own package management product

https://rstudio.com/products/package-manager/


Hadley works for RStudio, RStudio now have their own MRAN type mirror: https://packagemanager.rstudio.com/client/#/


MRAN takes daily snapshots, and is the repository powering this new package.


MRAN has saved my bacon more than once when I need to replicate some R environment written years ago. The package management in R really is terrible.


Reproducible? Or deterministic?

There's certainly benefits to being able to pull down research source code, and bug checking it. That's how programmers check code: tests and audits.

However I think reproducing research is more often then not done "from scratch", taking a new sample, treating it, checking results. "independent verification".

Re-using source code saves time, but I would argue not being able to shouldn't threaten reproducibility.


Ideally research does get reproduced from scratch; I think what people usually mean when they talk about the replication/reproducibility crisis in science is not being able to reproduce an experiment with new samples, independent data analysis, etc.

However, if you can't even reproduce an analysis with the authors' own data and code, that's a red flag before you even get to the starting line. Ensuring that level of reproducibility is, I think, an essential ingredient to enabling the stronger form of reproducibility.

Personally, I made the mistake during my graduate career of trying to reimplement an analysis using a certain rather complicated ML algorithm, from scratch, in a different language than the original authors had used. After struggling mightily to get it to work, I finally bothered to try to get their own code working. (I had been hesitant to do so because I wasn't proficient in the language they used, and it wasn't even clear they had released all the necessary code, aside from the core algorithm.) Once I did that, I discovered that I couldn't even get their own code working on their own data, and gave up. This was researched published in Science by a group from a top-tier research university. (I don't fully blame the authors, it may well have been my own incompetence that was the issue. But it just serves as yet another illustration of how pervasive and disregarded the reproducibility issue was for a long while.)


This is why people are starting to make a difference between terms: repeatability, reproducibility, replicability.

> You give me your code and enough information for me to produce and identical environment or (even better) your code is insenstive the environment, then your research is Repeatable.

> If you describe your study sufficiently well that I can re-implement your study from scratch, without looking at your code and still get the same answer, then it is Reproducible

> If I can arrive at the same conclusions as you, just from a description of its aims, then it is Replicable.

From https://academia.stackexchange.com/a/118518/15198


> Re-using source code saves time, but I would argue not being able to shouldn't threaten reproducibility.

More often than not it‘s not clear from a paper what exactly the authors did to a achieve a specific result. Being able to exactly reproduce what previous authors did should improve reproducibility; also for new samples.


It's also standard fare for typical lab work. A good paper's methods section would contain enough detail for you to go into your lab and repeat the experiment yourself, even down to the catalog number for the reagents to order from the lab supplier. Code should be no different, that's why it's encouraged that authors submit all code used in analysis and generation of figures.


The fun thing is that there are approaches that want to go beyond this kind of methodological description of a scientific process to code [0, 1]. In general I would say that the more we can remove the human aspect and inherent ambiguity of science, the better for reproducibility. See [2] for a couple of examples.

[0] https://www.emeraldcloudlab.com/ [1] https://nextjournal.com/ [2] https://www.youtube.com/watch?v=L1UgdoP2aeg


Back when I studied this stuff there was a distinction between reproducible (rerun analysis on the data from the original experiment and see if you get the same results - if not then there is an error in the analysis) and replicable (redo the entire experiment by taking new data and running the analisys).


That's something that eludes software people: reproducibility in science is the ability to create independent tests. Making software available, while useful, does very little for reproducibility from the scientific point of view.


It's a minimum standard, though. Of course the goal is reproducibility from a broader point of view, but that's not an excuse to do research in a one-off way where nobody is able to show how to get those numbers again, after a year or so from publication.

The coding standards are often abysmally, unexpectedly terrible. Often not even the help of the original authors is enough to be able to produce the same figures from a paper because things and settings and commands get forgotten. Some part of the analysis was done in one language, another part in Excel. Some of the code has now disappeared. Some of the libraries are no longer working. Some people left and their academic storage space was wiped and therefore the intermediate steps and results or notes are deleted. You wouldn't believe it.

Once a paper is published researchers are not really incentivized to document things or maintain the materials. They got the publication, they put it on their CV. On to the next project! No time to waste on work that's already completed. New work leads to new publications, messing around with the old code for the sake of a potential later person interested in it is a waste from the point of view of a researcher, career wise. Also most papers are never attempted to be reproduced ever.


The researcher's job is to properly do an experiment and document it in a paper. If we require more than this, then we will incur in damage to the scientific process for two reasons: (1) companies are not willing to make available software developed by their researchers, therefore they will publish even less; and (2) universities don't have money and staff to produce and maintain software at these standards, so professors will be required to publish less papers.


"These standards" are pretty low. Currently it's a free-for-all chaos. Theoretically papers are reproducible from the documentation found in the paper but that is a lie. It is never reproducible just from the paper. Lots of stuff is done in the background that is not known to the reader. For all we know, they can even tweak their numbers to be 2% better and if someone can't get the results of the paper from the released code, the authors can just ignore it or say, the problem is not on their side, or that the paper numbers were generated with a slightly different code than the released version etc. I've seen this many times on Github, issues getting closed or deleted without comment etc. There is zero accountability.

It's slowly changing though but many people are grinding their teeth, because they can't torture the data as much if things are out in the open.


> "These standards" are pretty low. Currently it's a free-for-all chaos.

I disagree. It is not perfect, but it is certainly a process that enables scientific development, as it has for centuries. If we start to create more and more rules that researchers need to follow, it will become even harder to make scientific research and most institutions won't have resources to continue.


Maybe not rules but expectations culturally among researchers. If you want your work to actually get used, make it usable. I guess it's different in different fields. I'm most familiar with CS and machine learning. There, it's more and more a community expectation to have access to the code or be very skeptical of the numerical results. It doesn't mean the other parts of the paper are also discounted, so if there are genius ideas in the explanation, it is still valuable. But people can do any number of things to squeeze out a benchmark advantage of 1% and I only trust that if I have code (sure, GPU ML code is not fully bitwise reproducible as of now, but it's being worked on. It's becoming technically possible but not everyone has heard about it or understands why). Even without bitwise repeatability, I want the code and instructions how to run the published experiments. I don't believe benchmark claims otherwise. Simply too much noise is being pumped into the literature. To much data torturing for career reasons, for visa reasons, for reasons of "but I must publish to not get fired and finish my PhD, and the reviewers will only let me publish if I beat the benchmark, so I'll do my analysis over and over until I get 1% better".

The culture needs to improve. Benchmarks shouldn't be everything but reviewers are inexperienced. Many reasons in many parts of the system. In other fields the issues are different. They are more about having to obtain statistically significant results or you fail your career.

It's a good thing that people are waking up to this. It's not about punishing the individual researchers, it's about our collective intellectual immune system. We can't digest this firehouse of papers if it's poisoned to such an extent. It's not about charity or burden. It's about being skeptical when we know we're dealing with unreliable data. Science is a massive endeavor with massive quality differences between works and researchers and groups. Blind trust is no longer enough if you care about keeping your beliefs curated.


It's true that "reproduce" can mean different things in software and science. But having the code available to "reproduce" any plots in a paper should be a requirement for publication, imo. It certainly is the case for my papers.


There’re fields of science, like computational biology, where it’s all about the code. I wish the methods section was always 100% unambiguous, but it’s not the case. And nowadays the computational pipelines have to support analysis of up to terabytes of data. You can imagine how many dependencies such pipelines have. Sometimes I have trouble installing the software even when package manager such as anaconda is used.


This title is an exceedingly hot take for someone who wrote a new package manager.

Also, it appears that Groundhog is itself a CRAN package and the author recommends installing with install.packages(). So is the author committing to never making any backwards incompatible updates to their new package?


I think it's more like a Wayback Machine for R programs, since the author of a science paper isn't required to use groundhog. You can just provide it the date the article was published, which you already know, and it reconstructs how the program worked on that day.

Also, because groundhog isn't made for the author to use, whether or not the interface changes is irrelevant. You'll never encounter library(groundhog) in a paper.


> and it reconstructs how the program worked on that day.

It reconstructs how the fully updated version of everything worked that day which isn't necessarily the same as the researcher's environment. It's a horrible idea to use dates instead of package versions for this. The author's library doesn't solve the problem it claims to solve.


If I am understanding this correctly, the problem is that the paper authors do not provide a specific version or a package.json equivalent. In that case, using dates seem to be the only choice.


Even if that's the case, using dates isn't a solution because dates don't give you the build that the researcher used. Date of publication is different from the date when the code ran and there is no guarantee that the researcher ran the latest version of every dependency that was available to them anyway. In fact that's very unlikely considering that some their libraries might require older versions. It might not even be possible to take the latest version of every package and use them in the same environment.


So, what is your better alternative then? I honestly believe using version available at that date is better than using the latest version.


That's the problem. This package is very similar to Microsoft's checkpoint package which is based on Microsoft's MRAN snapshots, and this package also uses MRAN. The article explains the difference is that this package allows you to specify the date in the code itself, whereas checkpoint is used to set a whole installation to a specific date. But this is no advantage as it means code will stop working if the groundhog package changes, whereas with checkpoint a paper could just say 'use packages as of date x'.


> So is the author committing to never making any backwards incompatible updates to their new package?

Well, yes, probably. It's not all that hard, and groundhog seems to have a fairly simple API anyways.

And groundhog still uses CRAN packages, it just brings a method of pinning them to a specific version.


Your take seems a bit 'hot' too?

How else would you install the cran packages without using install.packages? Unless if you want them to recursively install it using groundhog but that seems unnecessary.

As long as you have the timestamp it should work, though I assume there will be some edge case.

What you're saying is like don't use pip because you don't install it using pip? Or don't use package-lock.json because you can't install npm through npm?


> Your take seems a bit 'hot' too?

OP is not claiming that Groundhog itself is a threat to the R language ecosystem itself, whereas the author is claiming that the R language is itself a threat to Science itself...


No, I’m saying don’t call CRAN a “threat to reproducible science” and then make your solution a CRAN package


Someone correct me if I'm wrong, but can't you copy and paste the package folder into your libpath directory and R can load it that way with actually running install.packages()?


Usually, yes. However, it is possible for a package to have code that only runs when it is installed. If you just copy-paste, it won't be run.


While I agree with many of the negative comments here about issues with how this is implemented, the tone of some comments is... not great. To the point that I would be reluctant to share work I do in R on Hacker News, which is not helping anyone.

Just a reminder: https://news.ycombinator.com/newsguidelines.html


No kidding. Seems like an elegant solution to a potential problem to me. I've only used R at the "poke stick at numbers" level, but this would have been a useful addition to that trivial use.


I’ve yet to use it personally, but renv [1] seems to try to solve the reproducible builds problem in a way more similar to other modern package managers (e.g. by generating a lockfile).

This approach enables stricter validations against tampering with the package repositories as a hash of the package can be stored in the lockfile, however it is obviously a bit more complex to use than the groundhog approach.

[1]: https://github.com/rstudio/renv


Agreed that renv is a better solution here. Even the example code for Groundhog is not written in idiomatic R which does not inspire confidence. Simonssohn is a legend in transparent research but not primarily a coder or software tool contributor (take a look at the source for p-curve if you want to see what I mean) and I think a secondary threat to reproducibility is relying on tools that end up abandoned or deprecated or for which bugs never get fixed.


>Even the example code for Groundhog is not written in idiomatic R which does not inspire confidence.

Not to mention the given example for irreproducibility in base R looks at code that would be a bug in the script for 3.6. It's only useful to keep this reproducible if I'm debugging the script.

And, in this case, anyone who's proficient with R would recognize this problem from personal experience or the many warnings in tutorials. I usually wouldn't shoot down a given example as though it disproved the existence of any example, but I don't know if there is another example. Unless old code relied on undocumented or contrary-to-documented behavior.


I came here to say this.

This seems like a non-issue given renv. And renv gives a more reproducible, I think, solution as it pins to versions, not dates.


Nothing about this is specific to R.

If you want to guarantee reproducible results you have to use a container/image with libraries added at build time. Anytime you are relying on floating versions or downloaded libraries you will have issues.


Even this isn’t enough to be reproducible for complex numeric code as switching CPU can make a big difference with small differences being amplified. Hopefully none of those cases matter but it’s hard to definitively prove that.


If the research results depend on small differences being amplified you have a much much bigger problem. (but if course this could happen unnoticed/sloppy work)


That's true but not an excuse! It's still extremely important when assessing an anomaly. If you can say "okay this is a known-good config that gets me the numbers from the paper", it's an enormous help in uncovering what leads to issues.

If you can't even get those numbers, then you can suspect any number of things. Maybe you're not using the right data, maybe there was a typo, maybe someone fraudulently manually tweaked the numbers, maybe you forgot to do a step in the processing chain etc etc. There's no way to know what's going on if you can't even be sure how the original numbers were created.


Yeah or even just vendorize your dependencies.


There are two camps in the R world - tidyverse and base-R (tiniverse).

Its not a coincidence that the author gives an example from the tidyverse ecosystem. Authors and users of tidyverse value other things like consistency and new features over API stability and backward compatility. The base-R ecosystem is actually very stable and so the original package manager is very simple.

With R spreading out from the academic environment and with many new authors breaking their packages' APIs we observe new attempts to solve the issues with dependencies (such as renv or https://rsuite.io)


Title aside, the purposed solution just

- use Microsoft MRAN which did the heavy lifting of hosting archives

- use date instead of version

- install package automatically in first time (which pacman::p_load has been doing for ages) and easier to use in script level.

It's not coincidence that most package manager solutions used version instead of date to control the environment:

- A paper published on 2017 may used a date in 2017.10.01, but there is a high possibility that some of the dependency packages might be of earlier date, unless the author update packages every day/week, which is not a good habit anyway because updating too frequently will break things more frequently.

- Then how can you reproduce the environment using a date? The underlying assumption that all packages will be latest till that date simply doesn't hold.

That's why packrat/renv etc will use a lock file to record all package versions, and why you will need a project to manage libraries, because you will need to maintain different library environments and cannot install to same location.

Yet the author take installing all packages to a single location as a feature since you don't need to install same package again, and try to avoid project and prefer script as much as possible when doing reproducible research?


Wow, that's a lot of pessimism for a fairly elegant solution to the fact that almost no R code has package versioning defined.

I think the major sales point here is:

> A nice feature of groundhog is that it makes 'retrofitting' existing code quite easy. If you come across a script that no longer works, you can change its library() statements for groundhog.library() ones, using as the groundhog.day the date the code was probably written (say when it was posted on the internet), and it may work again.

I don't know how good ratpack is now a days. I've never met an R application that uses it, but at my old work, we would take a dated snapshot of CRAN at the beginning of every new project. If we needed to update a package we could then "update CRAN" for that project. When productionising a project it would be frozen to a date in CRAN.


>Wow, that's a lot of pessimism for a fairly elegant solution to the fact that almost no R code has package versioning defined.

This isn't true.

https://mran.microsoft.com/documents/rro/reproducibility

https://rstudio.github.io/packrat/


I'm curious whether this actually solves the problem. I understand how this assists with reproducibility of packages, but the R software itself is updated frequently, as is briefly noted in the preamble to this document. Indeed, the release notes [0] are fairly transparent about the relatively long list of changes.

Given this, it almost seems more dangerous to imply through this package that a particular date's results are reproducible, since unless the user has the same version of R, they may see different results anyway.

[0]: https://stat.ethz.ch/pipermail/r-announce/2020/000653.html


I find the miniconda docker image quite useful for making reproducible R environments.

You can install specific package versions recorded in environment.yml file.

There are probably many ways to do this but this is an approach I like.

https://docs.anaconda.com/anaconda/user-guide/tasks/using-r-...

https://hub.docker.com/r/continuumio/miniconda


Very clickbaity headline. The problem described is real, but just as real or worse in other statistical software, so it's not 'R' as a whole that poses a threat to reproducibility.


I’ve had some brief run-ins with R, and it doesn’t surprise me that it doesn’t have a versioning story for packages, and that the patched-in system described here is based on dates rather than something like a SHA or version number...

My favorite description of the language comes from http://arrgh.tim-smith.us/:

> R is a shockingly dreadful language for an exceptionally useful data analysis environment.

I feel like this is just one more data point to support that statement.


Packrat and its successor renv are the most popular package management systems for R, and they are based on versions/SHAs and lockfiles, like most other languages today.

https://rstudio.github.io/packrat/

https://rstudio.github.io/renv/articles/renv.html


I wish the (default) utils::install.packages function could take a version number of the requested library. I also wish library() would automatically install libraries not available on the system. (Both can be achieved with custom functions that shadow the default ones but I would like to see this functionality in the base packages.) Other than that, I think all alternatives to this "threat called R" are worse. It's telling the author has to cite a bug from 2016 for an example of a breaking change.


> The problem is that packages are constantly being updated, and sometimes those updates are not backwards compatible.

Uh oh, someone just discovered the modern programming landscape!

Python, Node, R, Rust, and other langs/OSes with package managers are at the mercy of volunteers who keep important packages healthy. Once issues stop being fixed, y'all better have local copies. This used to be predominantly an OS issue, now it is a language issue, too.


> The problem is that packages are constantly being updated, and sometimes those updates are not backwards compatible.

> Python, Node, R, Rust,

Correct me if I'm wrong, but for binary programs, a lock file easily mitigates these issue. I know Node and Rust both support lock files.


Yes, you're right. That's what makes lock files so important. I think we're past worrying about those wheels/npms/pkgs disappearing from the internet.

My concern is more about packages going stale and don't peer-match with other packages that evolve, or major versions that change results: not so much for R pkgs but there have been cases of major versions breaking existing projects, or requiring significant effort to update. (One example that zinged me is the FFI interface for Node. The "official" package hasn't been touched in years, and the "replacement", FFI-NAPI, is still has lots of open issues. We were using in-house fixes for some time.)


His example is poor. dplyr wasn’t even at 1.0 in 2016. It was only 0.50: https://blog.rstudio.com/2016/06/27/dplyr-0-5-0/. Of course one might expect breaking changes in a maturing package.


Can recommend the paper "A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker" by Peikert and Brandmaier [1], which shows a much more robust approach to reproducibility.

[1] https://psyarxiv.com/8xzqy/


Thanks!


Being able to assemble a solution from parts (as in R packages) is super flexible. But complex and potentially brittle.

Reproducability is a big problem all around. When I create releases I put the binaries as well as the source in version control, because changes in tools/libraries etc mean that I probably won't be able to create the exact same binary several years later from the same source.

There is always a tradeoff between flexibility and simplicity. Clearly software needs to be able to change, or you are never going to be able to improve it or fix bugs. And an assembly of constantly changing parts is clearly going to come with its own challenges.

My own software product, Easy Data Transform (which competes with R to some extent) trades off some flexibility for simplicity by having a single set of binaries for each platform. You can't add any components (without hacking). So the same version of software should always give the same result.


Does R not allow you to lock into versions?

Is this person suggesting we never improve anything? :)


A very nice and useful library indeed; though I don't think the article really needed to sound so doomsdayish and apocalyptic about it.


I hope all these efforts for reproducibility and avoiding library resolution conflicts consolidate into one solution.


This language about "threat" seems a bit overblown. Especially when we ask: compared to what? Some commercial package where different versions might have different and poorly documented data storage formats? (Have you ever tried to read an old SPSS or SAS or STATA data file in any reasonable environment? It is a nightmare.) Excel??


This is not a problem the R poses, this is a problem that people pose. If you want to run the same code you need to bundle the source code of all packages and archive it with the data. This is not an R problem at all.


seems like a fairly esoteric way to spell “lockfile with hashes”, but hey, R seems fairly esoteric to me anyway.


Adding dates to source code? No thanks. If you want reproducibility, invest in guix. Everything else is a hack.


The only working solution I’ve seen is using Docker container with Jupyter Lab and all the dependencies installed. I hate pulling those huge images on my 256GB MBP, but it works. Of course, only bigger labs do that, since individual researchers are often unfamiliar with Docker.

However, if I run my software on HPC cluster, that’s no longer an option. The HPC at my university doesn’t allow running Docker, only Singularity containers(which isn’t supported on Mac).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: