Hacker News new | past | comments | ask | show | jobs | submit login

I still don't understand: Is there any good reason that raw data should not be published along with each paper?



An old concern is that someone comes along, takes your data, and then

a) shows that you have an error somewhere and disproves your paper, making you look stupid, or

b) does something better with your data than your paper, making you look stupid, or

c) runs a follow up study on the same data you are currently running and publishes before you have the chance

In my experience usually none of this happens, UNLESS you're in a really cut-throat/hotly debated/bleeding edge corner of science, which is fairly rare.

c) is actually problematic if you told your funding body that you're going to generate X studies with the data they paid for, since you can only publish X-1 studies. I've seen this happen with biologists not releasing their genome assemblies with the assembly paper so they could run basic follow up studies using that assembly as reference before anyone else can do it.


It depends on the situation. I work with a professor who came from industry. He once worked on a project where they did massive data collection inside the military. Over $1 million was spent over 2 weeks collecting data. A couple months after, some researcher called and said "Hey, I heard about the data you collected. Would you mind passing it on to me?". Of course the answer was no, not until we publish damn near everything we can out of it.

Another consideration is the amount of data. Sometimes you're talking about dozens or hundreds of gigabytes of data. Start multiplying this by the scale of universities and you come up with a large expense to store massive amounts of data most of which will be downloaded no more than a couple times. On top of that, the data must be cleaned and made anonymous, and if they make a mistake in the process, there could be large liability issues. See AOL https://en.wikipedia.org/wiki/AOL_search_data_leak


Where's the problem? If that $1 million came from public sources, then the data should have been shared from day one. If others use the data they should put the originators on the data as coauthors on the resulting papers.


> Where's the problem? If that $1 million came from public sources, then the data should have been shared from day one.

I'm not saying the data should never be shared. Often times you can contact the author of a particular study and they will provide anonymized data. Most of the time people don't ask, and to take the time to prepare data for public release would offer little to no benefit.

I do believe there should be "first rights" to publish on data collected with funding from public sources. As for an exact policy by which it could be handled, I don't have one in mind. When you spend dozens of hours writing grant proposals only to have most of them turned down, it would be a sucky situation to have to make the data available for everyone before you have gotten a single paper out of it.

> If others use the data they should put the originators on the data as coauthors on the resulting papers.

It may vary by field, but generating the data alone is almost certainly not sufficient for co-authorship. [1]

[1] http://www.icmje.org/recommendations/browse/roles-and-respon...


   generating the data alone    
Manus manum lavat.


In some of my research the participants are young children or are disclosing sensitive social information during the study, so we typically cannot share the data in fine-grained detail. It can be difficult to ensure confidentiality will be maintained for the participants.

I usually do my best to find a subset of variables that can be released in an anonymized form and still capture the true essence of the results and make that available. This isn't always easy to do though depending on the study details.


> I still don't understand: Is there any good reason that raw data should not be published along with each paper?

Confidentiality? What if correlations give away subjects' identities?


Most psychological researchers deidentify subject data, since you can only show it to a very small group of people otherwise. (which might prevent you from e.g. getting statistical help from someone outside the project).


Does a substantial portion of the subjects actually care about this?


Even if they don't care, there is a responsibility by the researchers to the participants to protect their identity. Your participation in a study may have no big effect right now, but down the line it could. If your project has federal funding and involves research on people, there are ethical requirements and trainings required before starting the study.

What if you completed a simple survey for a study and identified yourself as a member of a group (Christian, Jewish, Mexican, African American, etc.) and 5 years down the line, there was a big rounding up of your group. Suddenly your participation 5 years ago is leading to terrible life changes.

Please get to know participant protections more. They are very important for the integrity of the data used in academia. The NIH has a strict policies to ensure the safety of participants.

The Institutional Review Board (IRB) reviews studies for risks participants could face as a result of participating. Every person who has access to the de-anonymized data must complete an NIH training course.

Depending on your institution, the process varies in seriousness. Having worked on 2 submissions and helped on ~3 other IRB approved studies, it can be tedious. My institution takes 6+ weeks from submission to approval, which can really delay things. At the same time, it's a vital part of the process to ensure validity.


> Does a substantial portion of the subjects actually care about this?

I don't think that's how laws or ethics rules work, for one. And I wasn't talking about a particular experiment, for two.


I've found I'm exceptional in being at all concerned about privacy. I think that if you asked, most people would only want minimal effort put towards confidentiality, especially since the trade-off is that other researchers won't be able to double check the analysis.


I've worked as a journalist in which my (and many other data journalists') modus operandi was to publish the data. Often because the data was public record anyway. I now work in academia and the mentality is significantly different. Some of it is logistics -- I would say most traditional news organizations do not have the internal incentive or habit to figure out a way to publish data. Whereas with newer organizations, such as 538 [0] and Buzzfeed News [1], the data teams have editors to whom open-source and digital publishing is more the natural way of things.

But in academia, there are also set rules and precautions governing every study. I haven't proposed any research yet but my understanding is that if your study requires collecting data from participants, the Institutional Review Board requires you to be very clear to participants about privacy and confidentiality and that you follow the guidelines to the letter.

Additionally, there are datasets only available to academics that aren't available to non-academics (i.e. journalists), which speaks to the expectation that academics be very mindful about confidentiality promises.

[0] https://github.com/fivethirtyeight/data

[1] https://github.com/BuzzFeedNews/everything


To add a bit more for people who haven't been involved in an IRB:

The strictness varies by university. Ultimately the IRB is there to ensure safety of the participants and minimize the chance of negative consequences. A couple examples I've heard from around my university include...

A study was taking place outside. There was a chance of a bee sting occurring to the participant. The IRB required that the researcher have an epi-pen ready just in case along with any required training.

A paper used in a study had the wrong stamp on it (not the most recent IRB stamp). The document had not changed from the last time, but the rule was simple. Every document presented to the participant had to have the most recent stamp.

On a study in which it was expected that 1/2 of the participants would feel nauseous, they had to provide a place to sit, water, and small snacks.

And perhaps most importantly, you can quit any study you're participating in at any time, for any reason. Compensation is figured out ahead of time, and you're going to get lots of questions from IRB if you want to do anything to reduce the compensation for leaving part way through the experiment.


> I've found I'm exceptional in being at all concerned about privacy. I think that if you asked, most people would only want minimal effort put towards confidentiality, especially since the trade-off is that other researchers won't be able to double check the analysis.

Like I said, that's not how laws work.


There is also the question of how honest people will be when they are anonymous versus not. If you were studying cocaine use in America, publishing your identified data (to enable double-checking the analysis) would give the cops a list of people who admitted to illegal drug usage.

The current system with IRB allows for the collection of identified data. It also allows the data to be released, in full, as long as there is no way for an individual to be identified specifically. In some cases you can simply swap names for a participant ID (after randomly sorting the list). In other cases you have no choice but to publish some summary statistics because the data you collected is only applicable to a very small number of people in a small region, or there is enough data people could reverse-engineer likely participants of the study.

Science is about self-correcting when errors are made. There are plenty of reasons to criticize the current incentive structure for academics, such as there being little funding to replicate other studies, but doing away with confidentiality is absolutely not the way to go about it.


Even when studying STDs, or the effect of child sexual abuse, or how couples deal with infidelity in a marriage, or.....?


Why should it only matter if a substantial number of subjects care about confidentiality?

The danger for some people is death or worse. Why are you dismissing that risk?


you got the question backwards.

if enough subjects don't care about confidentiality you could use only them and publish the data.


Doing so would be publishing non-representative data.

Imagine an identified versus anonymous survey about cocaine usage in America. People will likely answer honestly if it is anonymous.

If it is identified, most people would be fine with it being identified and would mark no usage. People who do use cocaine would be more likely to not participate or lie on the answer.


There are sometimes ethics rules on how long data is stored for. If there's a commitment to destroy the raw data after a period, eg. five years, often it can't be publicly released (even anonymised) or there'd be no way of ensuring it's been destroyed when the time is up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: