Hacker News new | past | comments | ask | show | jobs | submit login

My PhD work is trying to address this by developing better ways for scientists to record and communicate their methods/protocols. Methods sections are NOT about providing the information needed to reproduce the work, often even the supplemental information is insufficient. They are a best guess at what needs to be done, and the methods section is often massively condensed due to editor demands and cites other papers that also don't contain the relevant information because some aspect of the method changed between papers. (I could go on and on about this.)

I spent 7 years doing experimental biology (from bacteria to monkeys) and trying to replicate someone else's techniques from their papers was always a complete nightmare. Every experimentalist I talk to about this relates the same experience -- sometimes unprompted. Senior faculty tell a slightly different story, that they can't interpret data of someone who has left the lab, but it is the same issue. We must address this, we have no choice, we cannot continue has we have for the past 70+ years, the apprenticeship system does not scale for producing communicable/replicable results (though it is still the best way to train someone to actually do something).

EDIT: An addendum. This stuff is hard even when you assume that all science is done in good faith. That said, malicious or fraudulent behaviour is much harder to hide when you have symbolic documentation/specification of what you claim you are doing, especially if they are integrated with data acquisition systems that sign their outputs. Still many ways around this, but post hoc tampering is hard if you publish a stream of git commit hashes publicly.




Wouldn't it be hard to achieve deep consistency between experiments, in so many labs around the world, with so different conditions/cultures/etc , and when the experimenters aren't experts in consistency , but in their science ?

Wouldn't it be better to use something like a cloud biology model - where you define experiments via code, CRO's compete on consistency(and efficiency and automation) and since they probably do a much larger volume of experiments than the regular lab, they would have stronger incentives to develop better processes and technologies ?


I work at a cloud bio lab. We run all of our experiments on automation, and all protocols must be defined in code. The latter is both the power and the difficulty -- when your protocol is defined in code it is explicit. However, writing code is both new and sometimes difficult for the scientists that we currently work with (molecular biology, drug discovery). I believe what we are doing is the right model. But it comes with this overhead of transitioning assays to code, so there is that against it. This is mostly just a matter of time though. Another nice thing about code is that you can't tweak it once it's running. You can define your execution and analysis up front to guard against playing with results down the road. Now that being said, there still needs to be a significant change to how research is funded and viewed by the public because pure tech solutions can't solve everything. Our tech can't decide what you pick yo research. It can't dish out grants to the truly important research. So it will take many angles to really solve any portion of this problem.


I do agree, it seems like right model, and will have a large impact.

Between automating labor, economies of scale in purchasing, and access to more efficient technology(like acoustic liquid handling) ,etc - isn't it just a matter of time before cloud biology becomes quite cost effective and combined with other benefits - it would be the only way that makes sense to do research, so funding will naturally go there?

Also - do you see a way to add the extreme versatility of the biology lab into a cloud service ?


> it would be the only way that makes sense to do research There will certainly be more than just one way, although I hope cloud labs are the front runner. Also cloud and automated are two separate concepts. We do both, but there's no reason that you can't just do one or the other. The automation is critical for reproducibility for many reasons. But I think the cloud aspect is mostly helpful from a business perspective -- it makes it easier on everyone to get up and running on our system. But there are many in lab automation solutions that are helping fight the reproducibility crisis. And on the flip side, there are cloud labs that aren't automated.

> do you see a way to add the extreme versatility of the biology lab into a cloud service We let you run any assay that can be executed on the set of the devices that we have in our automated lab. So in that sense, yes its very flexible. Also, there's no need to run your entire workflow in the cloud. You can do some at home, some in the cloud. Some people even string together multiple cloud services into a workflow. See https://www.youtube.com/watch?v=bIQ-fi3KoDg&t=1682s

That being said, biology labs can be crazy places. Part of what we do is put constraints on what can be encoded in each protocol to reduce the number of hidden variables. Every parameter that counts must be encoded in the protocol, because once you hit "go" on the protocol, it could run possibly on any number of different devices each time it runs. The only constant is that the exact instructions specified in the protocol will be run on the correct device set.


1. Yes, but the idea would be that if you provide a way to communicate the variables that actually matter for consistency then you can increase the robustness of a finding. If you have one lab that can _always_ produce a result, but no one else can, then clearly we do not really understand what is going on and one might not even be willing to call the result scientific.

2. Maybe not better, but certainly more result oriented. Core facilities do exist right now for things like viral vectors and microscopy (often because you do need levels of technical expertise that are simply not affordable in single labs). If there were a way to communicate how to do experiments more formally then the core facilities could expand to cover a much wider array of experiment types. You still have to worry about robustness, but if you have multiple 'core' facilities that can execute any experiment then that issue goes away as well. The hope of course is that individual labs as they exist today (perhaps with an additional computational tint) would be able to actually replicate each other's result, because we will probably end up needing nearly as many 'core' facilities as we have labs right now, simply because the diversity of phenomena that we need to study in biology is so high.


There are already approaches in this direction e.g. such as providing a standardized experimental hardware-software interface with Antha [1]. The complexity of the problems in questions (biological domain, biophysical, biochemical) is daunting - we do not understand many things, "there is plenty of room at the bottom".

[1] https://www.antha-lang.org/


How about recording videos of the lab? People trying to reproduce the experiment can just sift through the video. That may be tedious but it's far better than nothing.

Just a little metadata would help: Experiment A, Phase N, Day X


Video and photographic evidence can play a big role and when we have the extra bandwidth to process such a dataset. Right now we barely have time to do the experiments, much less 'watch tape' to see how we did (maybe if scientists were paid like professional sports players...). In an ideal world we would be collecting a much data as we possibly could about the whole state of the universe surrounding the 'controlled' experiment. That said video and photographs are very bad a communicating important parameters in an efficient way. Think about how hard it is to get information out of a youtube video if you need something like a part number. Photos do better, but if you need to copy and paste out of a photo we will need a bit more heavy lifting to translate that into some actionable format (eg ASCII).


I didn't mean record it to use the data in your analysis, but record it to preserve the methods for others. If they have trouble getting part of the experiment to work, they can pull up the video and see how you did it (at least to a degree; I'm not expecting 360 video. You can't possible record in text everything a video could capture.

Thanks for sharing your knowledge and experience in this discussion, by the way. It's what makes HN great.


Ah, yes, things like JOVE [0] are definitely useful but they don't seem to scale to the sheer number of protocols that need to be documented (eg a single JOVE publication is exceedingly expensive). I have also heard from people who have tried to record video of themselves doing a protocol is that it is very hard to make them understandable for someone else. That said if the 'viewer' is highly motivated videos of any quality could be invaluable. Sometimes it is just better to buy the plane tickets and go directly to the lab of the person who can teach you (if they are still around).

0. https://www.jove.com/


> if the 'viewer' is highly motivated videos of any quality could be invaluable

That's what I meant. Just stick some cameras in the ceiling (or wherever is best) and capture what you can. It seems cheap and better than nothing, but I know nothing about biological research.


True, biology is a horror show full of surprises. Many experimental instructions are as reliable as astrologic forecasts. I guess that is the price of complexity and the human factor.

I would love to hear more about your work, and the strategies you propose to improve the reproducibility of scientific experiments. My email can be found in my user description.

Cheers!


Have you looked at Common Workflow Language (CWL), see below?

Friend of mine has been experimenting with wrapping it all up in Docker containers! :-)

> a specification for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. CWL is designed to meet the needs of data-intensive science, such as Bioinformatics, Medical Imaging, Astronomy, Physics, and Chemistry.

http://www.commonwl.org




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: