Stanford AIMI releases free open-source repository of medical datasets

heybrendan · on Aug 6, 2021

TFA:

> Democratizing the Tools

> We love that corporations are doing all this work, but we don’t love the fact that the opportunity to share information is asymmetric,” Lungren says. “If they amass data but then lock it down, they will be the only ones who can innovate, which would shut out the important contributions by computer scientists and clinicians around the world. That’s not a position we want to be in.”

Uh huh.

Let's scrutinize the "Stanford University School of Medicine MURA: MSK Xrays Dataset Research User Agreement":

> By registering for downloads [...] you are agreeing to this Research Use Agreement [...]

Hiding downloads behind a registration. We're off to a FANTASTIC start. /s

> YOU MAY NOT DISTRIBUTE, PUBLISH, OR REPRODUCE A COPY of any portion or all of the MURA: MSK Xrays Dataset to others without specific prior written permission from the School of Medicine.

> YOU MAY NOT SHARE THE DOWNLOAD LINK to the MURA: MSK Xrays dataset to others.

You know what? I'm going to stop there. The agreement becomes increasingly more hostile.

Someone please inform Stanford on what "Democratizing" actually means.

It's a hard pass from me.

logimame · on Aug 6, 2021

To make things worse, the site doesn't work unless you disable Ublock Origin (and even when it works it's goddamn slow), and for some reason you get a location access request from your browser right when you click an entry (is there any reason they have to collect geographic data for a download link?)

jb_s · on Aug 6, 2021

doesn't sound very "open source movement"

nl · on Aug 6, 2021

So people are pretty down on the licensing for this. I agree it is less than ideal, and most certainly isn't open source.

But it is a tremendous step forward. Previously this type of data was extremely hard to find, and even if you had budget it was a long, slow process finding someone who would sell you something to test hypothesis about what models could work.

With this you can work on building models that work, demo them as much as you like, and find a source that lets you train for commercial outcomes. That's very useful. Less useful than properly open data, but useful none-the-less.

teddyh · on Aug 6, 2021

> most certainly isn't open source. But it is a tremendous step forward.

And if they’d called it that and nothing else, nobody would have a problem.

alphaoverlord · on Aug 6, 2021

With AIMI, we released EchoNet-Dynamic, the largest open dataset of echocardiograms (cardiac ultrasounds) and expert cardiologist labels as part of a paper published last year. The dataset went through a rigorous review to make sure no identifying information was leaked as part of the process. Happy to answer any questions.

Areading314 · on Aug 6, 2021

The data license seems to be research-only. How would people be able to build products/medical device software with this license? Or is that not a goal of releasing this data?

alphaoverlord · on Aug 6, 2021

It’s a research dataset, similar to MNIST or CIFAR. Stanford does not want to be in the business of monetizing patient data, so it restricts commercial use.

the_optimist · on Aug 6, 2021

You just stated a paradox like it makes sense. If you _didn’t_ want to be in the business of monetizing data, while providing data, you _wouldn’t_ restrict commercial use.

elmolino89 · on Aug 6, 2021

how broad is the non commercial use clause? I can imagine i.e. some BigPharma buying another datasets and using your data sets for who knows, validation of the acquired ones/metadata improvement etc. No commercial product in the area of imaging/diagnosis but maybe some commercial drug 10-15 years down the road. Do you think that such use is also forbidden by the licence?

Areading314 · on Aug 6, 2021

The datasets are deidentified, so that doesn't seem like a plausible rationale

nl · on Aug 6, 2021

The issue of monetizing is completely separate to de-identification.

Often sources of this type of medical data will give them to universities under the no-commerical-use condition.

Areading314 · on Aug 6, 2021

The reason they do this is so that they can later monetize the dataset.

nl · on Aug 6, 2021

Often, no.

It's because if they charge for it the admin of having to pass money back to participating people is prohibitive.

Often you can't buy these datasets at any price.

IfOnlyYouKnew · on Aug 6, 2021

Just look at how agitated people here are getting at the prospect of GitHub copilot using tiny code snippets from their work for potentially commercial works.

Then imagine it’s not your unique way to loop over a file in python, but your medical information.

jbarrs · on Aug 6, 2021

I for one am pretty excited about this. I took a module in medical imaging at university and went looking for real data that I could use in order to experiment with the reconstruction algorithms we were learning about, and I struggled to find much useful data. I'm sure this will be plenty useful to those looking to experiment with AI, but I hope this will also be accessible and useful to students in relevant areas.

d4rkp4ttern · on Aug 6, 2021

Sounds like only/mainly image data? EHR/EMR (Electronic Health/Medical Records) data are super valuable to have — these are histories of patient-related readings, visits, diagnoses, treatments, etc.