I found Zenodo to be the best out there by far in terms of ease of use and a decent amount of default storage (50gb). We have to upload data by law in the UK due to funding requirements by the research councils and the Universities own offerings are normally pitiful.
We recently wrote a commentary describing how to share biological data and describing some of the specialist and generalist repositories where one might do so:
Do you know if there is any standardized way to serve metadata on the papers themselves? I want to serve my findings in a way that makes them easy for others to include in metastudies but I'm not allowed to share the raw data (privacy reasons)
All I have found is ScienceVerse [1] that aims to develop a syntax/"a Grammar of Science"
In addition to sites such as those listed, Universities often provide institution-level data hosting. So long as the university doesn't go under, they ought to be stable. An advantage is that there are local people who can help the researchers with the process, e.g. in setting up useful metadata and so forth.
I worry a bit about people just dumping data into large repositories, without thinking much about the format or the later uses, but only focussing on a checklist that needs to be ticked off to get that precious bean (publication) for the bean-counters (deans).
I work in research in a public university in Canada. IT is basically tech support, fix-my-email. There's no chance they would support hosting our data or any other sort of service.
The university expects that researchers self-fund their own stuff using grants. Need laptops for your grad students? Grant money. Need a server, bunch of disks and a sysadmin to care for it? Grant money. Which is only realistic if you're a huge lab with millions a year in grant money. And even then, what happens whn this grant runs out? Your new grant does not pay for hosting some 10 year old data, all money is earmarked for your new project (literally, it would be illegal to spend the money on another project). So the old hosting quietly goes away.
I checked this out. Turns out I was aware of the one for my university, but they only host written documents. Mostly PhD theses, a few other bits and pieces. No code or huge datasets.
Zenodo is a great place for code and datasets. For every paper we put any associated code, and any data that doesn't have a place in a more specialized repository:
Dalhousie used to have an Academic Computing Services department within IT, which was designed to provide computing expertise to researchers - development of software, hosting, and related services to support research projects. I was told it was fairly unique. It's been a while since I was there but AFAIK it was axed or at least cut back as being unnecessary.
I wish, we asked our university and they essentially wanted to charge ridiculous amounts to our research group. IIRC we asked about 100TB and their estimated cost was like 10k euros a year.
Regarding your dumping data comment. Yes that's certainly an issue. The problem is that researchers are required to do this, but there is no real consideration of the time this takes (it does not make a difference to your career/reputation etc. if you publish good or bad data), it's really out of the expertise of most researchers and there is little help provided by universities.
> In addition to sites such as those listed, Universities often provide institution-level data hosting.
(For those that haven't spotted it, these are permitted under 'Generalist repositories')
> An advantage is that there are local people who can help the researchers with the process, e.g. in setting up useful metadata and so forth.
I helped set up such a service nearly 10 years ago, and still help run it. There undoubtedly are advantages to depositing with us for the reasons you mention, plus we permit far larger publications than most services (our largest are around 1TB).
However we are a large, general university, and so have to deal with deposits ranging from theology related images to CT scans of fossils specimens to synthetic chemistry data. And all points in between.
Being general limits our capacity for detailed help concerning metadata and format standards for researchers since we just don't have enough data librarians with these specialisms. So my advice is to use a community established repository where available (UK Data Archive is a good example).
You are right about people just dumping data. Since 2015 (iirc) researchers have been expected by funders and publishers to plan their data storage and make it available ultimately. That doesn't necessarily lead to quality publications, though our reviewers try their best.
To paraphrase a researcher "I intend to give this process the minimum required". (This is not a typical response, happily)
None of these solutions are ideal, although Zenodo's better than most.
As far as I can tell, they're all targeted more towards the final, authoritative release, so it seems you're still out of luck during the paper writing process.
What if I'm just trying to share a dataset/pre-trained model with remote collaborators?
I ran into this when doing some OCR experiments[1], finding acquiring data and pre-trained models to be the most time-consuming part of the enterprise.
This ended up adding enough additional hassle that I didn't manage to get anything really interesting going, although figuring out how to containerize other peoples' code was educational.
Personally, I think I'll be relying on some combination of institutional repositories + torrents/IPFS for any large datasets/models I end up releasing in the future.
In my opinion, data should be mirrored on a torrent in addition to institutional servers (which can also provide checksums). Torrents offload the bandwidth problems from the institute to users and stay up if people use the data.
But that probably won’t happen because torrents are a dirty word due to illegal activity and they also give up control of the data.
TLDR: re3data and FAIRsharing appear to be registries of "where can you find this repository", in case they change URI, I guess. Not so much for finding specific datasets, just hosters?
I noticed many of these repos are javascript-walled. Is there any kind of standard API through which you can search for repos and fetch datasets?
If you are looking for datasets I would try datacite (https://datacite.org). While they are primarily a provider of DOIs I think it is fair to say that they are the de facto standard, and a good point to start searching for datasets. (https://search.datacite.org - pretty sure they have a reasonable API)
There are also services like JISC data monitor and of course the citation databases (e.g. Scopus) now contain datasets.