Scientific Data Repositories

physicsguy · on June 3, 2021

I found Zenodo to be the best out there by far in terms of ease of use and a decent amount of default storage (50gb). We have to upload data by law in the UK due to funding requirements by the research councils and the Universities own offerings are normally pitiful.

bravura · on June 3, 2021

Zenodo downloads are frequently unreliable for me and others I have spoken to.

michaelhoffman · on June 3, 2021

We recently wrote a commentary describing how to share biological data and describing some of the specialist and generalist repositories where one might do so:

https://febs.onlinelibrary.wiley.com/doi/10.1002/1873-3468.1...

5cents · on June 3, 2021

Do you know if there is any standardized way to serve metadata on the papers themselves? I want to serve my findings in a way that makes them easy for others to include in metastudies but I'm not allowed to share the raw data (privacy reasons)

All I have found is ScienceVerse [1] that aims to develop a syntax/"a Grammar of Science"

[1] https://scienceverse.github.io/scienceverse/

michaelhoffman · on June 3, 2021

I don't understand the question. Maybe you could give an example of the sort of metadata you want to share.

I recommend these article for discussions of how to share when you can't share all your raw data publicly because of participant privacy:

https://www.nature.com/articles/s41586-020-2766-y (I am a co-author)

https://www.nature.com/articles/s41576-020-0257-5

There's some more discussion in the article linked in the parent to but it's mostly not about that.

the-mitr · on June 3, 2021

Thanks, the argument are well put

bluenose69 · on June 3, 2021

In addition to sites such as those listed, Universities often provide institution-level data hosting. So long as the university doesn't go under, they ought to be stable. An advantage is that there are local people who can help the researchers with the process, e.g. in setting up useful metadata and so forth.

I worry a bit about people just dumping data into large repositories, without thinking much about the format or the later uses, but only focussing on a checklist that needs to be ticked off to get that precious bean (publication) for the bean-counters (deans).

auxym · on June 3, 2021

Maybe in large US universities?

I work in research in a public university in Canada. IT is basically tech support, fix-my-email. There's no chance they would support hosting our data or any other sort of service.

The university expects that researchers self-fund their own stuff using grants. Need laptops for your grad students? Grant money. Need a server, bunch of disks and a sysadmin to care for it? Grant money. Which is only realistic if you're a huge lab with millions a year in grant money. And even then, what happens whn this grant runs out? Your new grant does not pay for hosting some 10 year old data, all money is earmarked for your new project (literally, it would be illegal to spend the money on another project). So the old hosting quietly goes away.

michaelhoffman · on June 3, 2021

Every university institution in Canada either has an institutional repository or access to adoptive repositories for those without.

https://www.carl-abrc.ca/advancing-research/institutional-re...

auxym · on June 3, 2021

I checked this out. Turns out I was aware of the one for my university, but they only host written documents. Mostly PhD theses, a few other bits and pieces. No code or huge datasets.

michaelhoffman · on June 3, 2021

Zenodo is a great place for code and datasets. For every paper we put any associated code, and any data that doesn't have a place in a more specialized repository:

https://zenodo.org/communities/hoffmanlab/

GitHub even provides a guide to depositing a GitHub repo in Zenodo:

https://guides.github.com/activities/citable-code/

geenew · on June 3, 2021

Dalhousie used to have an Academic Computing Services department within IT, which was designed to provide computing expertise to researchers - development of software, hosting, and related services to support research projects. I was told it was fairly unique. It's been a while since I was there but AFAIK it was axed or at least cut back as being unnecessary.

physicsguy · on June 3, 2021

Compute Canada are developing a federated system

cycomanic · on June 3, 2021

I wish, we asked our university and they essentially wanted to charge ridiculous amounts to our research group. IIRC we asked about 100TB and their estimated cost was like 10k euros a year.

Regarding your dumping data comment. Yes that's certainly an issue. The problem is that researchers are required to do this, but there is no real consideration of the time this takes (it does not make a difference to your career/reputation etc. if you publish good or bad data), it's really out of the expertise of most researchers and there is little help provided by universities.

an_opabinia · on June 3, 2021

> it does not make a difference to your career/reputation etc. if you publish good or bad data

Yes, but good data could help people trying to monetize your research and give you nothing immensely.

shellac · on June 3, 2021

> In addition to sites such as those listed, Universities often provide institution-level data hosting.

(For those that haven't spotted it, these are permitted under 'Generalist repositories')

> An advantage is that there are local people who can help the researchers with the process, e.g. in setting up useful metadata and so forth.

I helped set up such a service nearly 10 years ago, and still help run it. There undoubtedly are advantages to depositing with us for the reasons you mention, plus we permit far larger publications than most services (our largest are around 1TB).

However we are a large, general university, and so have to deal with deposits ranging from theology related images to CT scans of fossils specimens to synthetic chemistry data. And all points in between.

Being general limits our capacity for detailed help concerning metadata and format standards for researchers since we just don't have enough data librarians with these specialisms. So my advice is to use a community established repository where available (UK Data Archive is a good example).

You are right about people just dumping data. Since 2015 (iirc) researchers have been expected by funders and publishers to plan their data storage and make it available ultimately. That doesn't necessarily lead to quality publications, though our reviewers try their best.

To paraphrase a researcher "I intend to give this process the minimum required". (This is not a typical response, happily)

clickok · on June 3, 2021

None of these solutions are ideal, although Zenodo's better than most. As far as I can tell, they're all targeted more towards the final, authoritative release, so it seems you're still out of luck during the paper writing process. What if I'm just trying to share a dataset/pre-trained model with remote collaborators?

I ran into this when doing some OCR experiments[1], finding acquiring data and pre-trained models to be the most time-consuming part of the enterprise. This ended up adding enough additional hassle that I didn't manage to get anything really interesting going, although figuring out how to containerize other peoples' code was educational. Personally, I think I'll be relying on some combination of institutional repositories + torrents/IPFS for any large datasets/models I end up releasing in the future.

-----

1. https://github.com/rldotai/ocr-experiments

bertday · on June 3, 2021

In my opinion, data should be mirrored on a torrent in addition to institutional servers (which can also provide checksums). Torrents offload the bandwidth problems from the institute to users and stay up if people use the data.

But that probably won’t happen because torrents are a dirty word due to illegal activity and they also give up control of the data.

shellac · on June 3, 2021

See https://academictorrents.com/. We serve a couple of datasets using this, in addition to our regular system, due to their size.

bertday · on June 3, 2021

I’m familiar with the site but I didn’t realize it was used as a primary source. If so, then I think this is a good example.

medstrom · on June 3, 2021

TLDR: re3data and FAIRsharing appear to be registries of "where can you find this repository", in case they change URI, I guess. Not so much for finding specific datasets, just hosters?

I noticed many of these repos are javascript-walled. Is there any kind of standard API through which you can search for repos and fetch datasets?

shellac · on June 3, 2021

If you are looking for datasets I would try datacite (https://datacite.org). While they are primarily a provider of DOIs I think it is fair to say that they are the de facto standard, and a good point to start searching for datasets. (https://search.datacite.org - pretty sure they have a reasonable API)

There are also services like JISC data monitor and of course the citation databases (e.g. Scopus) now contain datasets.