Mozilla, GitHub and Figshare team up to fix the citation of code in academia

roel_v · on March 17, 2014

"For another, the DOI is a persistent link. Broken links are a growing problem for academia, as link structures are changed and online content is edited. “If your persistent link is pointing to something on Figshare, which is ‘this GitHub repository at that version, at that release,’ then even if the GitHub repository changes or Figshare changes its link structure, that DOI will always point to that object,” Mark Hahnel, founder of Figshare said."

Yeah... well there is the achilles heel of this whole thing. "Always" in internet time means "2 or 3 years, or until we fail to close another round of funding" in real-world time. I'm probably being a cynical old fogey again, but let's see in 5 years time if this whole thing still exists (mozilla, for example, doesn't have a very solid track record of keeping projects up, to put it mildly) before I start putting URLs to this in papers that people will probably still reference every once in a while 5 years from now.

Then again, if everybody thinks like I do, it'll never get off the ground - classic catch-22.

trurl42 · on March 17, 2014

DOI exists since 1997, which is pretty much since forever in internet time.

Figshare on the other hand, not so much.

capnrefsmmat · on March 17, 2014

Figshare is mirrored by several universities in the CLOCKSS scheme:

http://figshare.com/blog/Ensuring%20persistence%20on%20figsh...

Twelve research universities keep copies of all the data and will make it publicly available should Figshare implode.

Other data repositories, like Dryad, do the same thing, and grant agencies with data deposition requirements usually require you to deposit data somewhere that has a long-term plan to ensure access.

So as long as the twelve universities manage to update all the DOIs to point at the new locations, Figshare DOIs should be reliable.

CJefferson · on March 17, 2014

Yes, I don't trust either figshare or github to be here in 10 years (there was a time when I thought sourceforge would be around for ever, now I keep expecting to see it disappear any day now).

toomuchtodo · on March 17, 2014

Why don't they partner with the Internet Archive? Its specifically structured to be a long term archival/reference system.

ThePhysicist · on March 17, 2014

I agree, it always make me cringe when privately held companies want to define an "open standard" for scientific citations that (surprise!) relies completely on their proprietary infrastructure. I still remember the case of Mendeley, which promised to build an open repository for research documents, and which is now a subsidiary of Elsevier, an organization that does not really embrace "open science", to put it mildly. I think what we really need (and the article mentions this at the end) is an open standard that can be implemented by anyone, most notably universities and research institutes.

yeukhon · on March 17, 2014

The lab I work for tried to do this for a few years. But the problem with this (cloud computing + storing in remote repository + trying to do reproducible science + collaborative scientific computing) is always a tough sell. The idea is neat. Everyone likes it. But transiting to the remote platform, letting others to host and keep your data, and not always accessible to the machine is a tough sell.

That's why some scientists are going to use Google's Compute Engine. They just need the machines. The researchers have their own C++ and Python scripts. They can live with some complexity. They are happy with HTCondor which is awesome for running big computational jobs.

Sharing data results with the world is awesome. But transiting to a new platform is again a big problem.

mcguire · on March 17, 2014

Right, sharing code is always a lot more work than simply telling people what happened when you ran it.

gjuggler · on March 17, 2014

This is a very cool technical integration between two services that are — or should be — used by most scientists working in code. But what exactly is the "problem" of citation of code that this solution fixes?

Let's say you release your project to GitHub & figshare and now have a DOI in hand. What are you supposed to do with it? Do you ask your users to cite this DOI if they use your software? If so, what text should accompany the citation? How do you track citations to your code? Will they show up in Google Scholar, Scopus, Web of Science?

And what if the journal one of your users is submitting to doesn't accept figshare / github citations? It's unfortunate but true that many publishers disallow citations to unpublished / non-academic works. This is why many scientific software projects have resorted to publishing papers on their software — it's a hack to make a software project fit into the traditional social system of scientific credit.

DOIs are a technical glue that binds together the thousands of academic publishing outlets, but they do not solve the scientific or cultural issue of what is the minimum viable citable scientific product, and how those citations are generated, propagated, or valued.

Securing a DOI only solves a small slice of the problem of scientific credit — a point most colorfully expressed by this blog post from CrossRef, the largest DOI registrar for academic work: http://crosstech.crossref.org/2013/09/dois-unambiguously-and...

Fomite · on March 17, 2014

I'm rather pleased - I had emailed GitHub a month or two ago asking about the potential to get DOIs for repos, and here they are.

Worst case, GitHub and Figshare both go under and we're back to where we started. The one hesitance I have is about the Figshare/DOI'd repo being frozen in time - I keep making arguments to myself about how this is a good idea or a bad idea.

jrochkind1 · on March 17, 2014

Well, the thing with git, is it keeps history anyway. There's no need to actually freeze the repo in time -- but the DOI can be to a URL representing (and linking to) a particular moment-in-time of the repo.

Is that what they're doing? That actually seems like a pretty good idea, if it is, I hope they are! And if they're not, it would be trivial to do.

Github's UI still makes it easy to see what happened after (or before) that point (including the 'latest' version), but if you're citing software used as a tool for research results, it makes sense to be able to cite the actual software that really was used, not it's hypothetical future evolution.

mmcclellan · on March 17, 2014

well, if your aim is to reproduce the results of the paper, then you really do need the frozen version. I would probably just add a note in the Readme that says development is ongoing and users who just want to use the code should probably clone master or some such. Hopefully, it will become common to also include automated configuration (say Ansible playbooks) with the source.

csense · on March 17, 2014

If you use git, there is a way to produce, in a single line, a permanent immutable citation to the current state of your code's master branch. Are you ready for this revolutionary command?

    cat .git/refs/heads/master

Why are three large-ish organizations feel it necessary to combine their powers for something this trivial?

rspeer · on March 18, 2014

...Are you actually under the impression that you can take a hash of some code and retrieve the code from it?

T-A · on March 17, 2014

Somewhat related: http://www.webcitation.org/