Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Arguably, data is just as important. Academics hoard their data and try to milk out every paper they can from it. The reward system is based on publishing as many papers as possible rather than just making a meaningful contribution.


Data is much trickier because your data source for medical, education or even just regular businesses don't want the added legal weight of making data freely available.

This is obviously a shame, I was working on segmentation of open wounds and most papers include a "we are currently in talks with the hospital to make the data available". If you contact the authors directly they will tell you that their committee blocked it because the information is too sensitive.


It seems like there can be a balance between "the results are unverifiable because no one else can touch the data" and "effectively open-source the dataset"?

Something like: "To make it easier to verify the code behind this paper, we've used <accepted standard project/practice> to generate a synthetic dataset with the same fields as the original and included it with the source code. The <data-owning institution> isn't comfortable with publishing the full dataset, but they did agree to provide the same data to groups working on verification studies as long as they're willing to sign a data privacy agreement. Send a query to <blahblahblah> ..."


> but they did agree to provide the same data to groups working on verification studies as long as they're willing to sign a data privacy agreement. Send a query to <blahblahblah> ..."

This would be administrative overhead, it will be shut down 9 times out of 10. I understand why this might seem easy but it really is not, you can have multiple hospital that each have their committee that agreed to give the researcher their data. They don't have a central authority that you can appeal to, much less someone that can green light your specific access.

As for the synthetic datasets that's basically just having tests and was advocated for elsewhere in this thread.


> This would be administrative overhead

I didn't say it was easy--I just said it struck a balance relative to trying to openly publish the whole dataset. Yes. Obviously comes with administrative overhead. So did dealing with the initial researcher. If the institution can manage the one, it can manage the other.

> As for the synthetic datasets that's basically just having tests

An appropriate synthetic dataset would inevitably be part of a great test suite, but it's also pretty simple to write narrow unit-tests that embed rather than stretch the same assumptions and biases that are also in the code (i.e., simple enough that even people who feed themselves with code do it).

An independent project/practice for synthesizing sample datasets from the real dataset lowers the bar and clarifies the best-practice for releasing a dataset that a verifier could actually use to spot simple bugs, edge-cases, and algorithm issues. Ideally, yes, this practice nudges the researchers to bother running their program over generated sample datasets as well, and to pay attention to whether the results make sense.


The reward system also prevents dead ends from being identified, publication of approaches that did not lead to the expected results or got nul results, publishing confirmations of prior papers, etc.

Basically, the reward system is designed to be easy to measure and administer, but is not actually useful in any way to the advancement of science.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: