The impact of Docker containers on the performance of genomic pipelines

jamesblonde · on Nov 11, 2015

This paper misses the bigger picture that genomics is a Big Data problem. Setting up pipelines to put together perl, bash, python, and C++ programs is not where the field will be in a few years time.

csirac2 · on Nov 11, 2015

I think you underestimate the diversity of genome research activities, technologies and methods out there :) It's such an incredibly fragmented field; sure, many ad-hoc pipelines eventually become productized beyond a pile of scripts and a dozen or so users, and there are definitely plenty of applications which demand HPC and "big data" techniques - but that describes a tiny fraction of all the research projects out there.

In any case, many parts of the field simply don't have the software engineering discipline to pull off proper "big data" workflows. Advances in commodity hardware, stronger programming tools for ad-hoc work, and "cloudification" toolchains will probably delay a lot of what used to require proper engineering effort from maturing.

Not to mention there's plenty of fertile ground solving problems which by now can be answered with merely "annoyingly non-small" rather than "big data" techniques.

heuermh · on Nov 14, 2015

(Minor) co-author on the paper here, just wanted to second your experience in the field.

The big win I've found with Nextflow is that once you've written a workflow, you have a lot of flexibility in the execution environment: Have all the tools already installed on your workstation or large compute instance? Use the local executor to saturate the box with concurrently running jobs. Don't have or want all those tools installed? Use the local executor with Docker images. Have access to a traditional compute cluster (e.g. LSF, SGE, Torque, etc.)? Use the cluster executor with Docker images.

A couple other resources worth checking out:

Toil workflow engine https://github.com/BD2KGenomics/toil

Common Workflow Language (CWL) specification https://github.com/common-workflow-language/common-workflow-...

csirac2 · on Nov 15, 2015

That sounds fantastic. I no longer work in bioinformatics but regularly keep in touch with some of my old colleagues. Definitely going to speak in person about this with them.

hirenj · on Nov 11, 2015

I actually found this to be a very timely paper, as the set-up of an RNA-seq workflow has been added to my to-do recently. One thing I am not keen on is the systems administration part of the work, and I like the idea of being able to spin up machines easily using docker.

I'm not sure about your comment about Big Data. If it turns out every hospital will have a next-gen sequencer, it seems to me the pipelines are really important, and having someone else working out the logistics sounds good to me. I'm curious as to what you consider the tougher big data problem. From where I sit, I consider this to be a complex data problem, and the issues of data size are simply solved by better/faster computing.

yread · on Nov 11, 2015

We're currently trying out https://arvados.org . It runs on Docker, you define pipeline with json and some python soup that calls API that takes care of provisioning worker nodes. I hope that future will look something like that. And that somebody else will pay the cloud bill....

IanCal · on Nov 11, 2015

Thanks for this, it looks like it might tie up a few things I've been thinking about building myself (the tracking of inputs & outputs specifically).

alsocasey · on Nov 11, 2015

Large scale genomics initiatives certainly can be 'Big Data', but with the costs of read generation continuing to decrease, the pipelines described in the paper are going to come into the hands of smaller groups and be applied to smaller studies... the RNA-Seq pipeline described in the paper is anything but big - an experiment with 20x the number of samples can be run on commodity hardware overnight - but it is the type of study a small lab might want to run. And for that type of setup, there's really no need for a large infrastructure setup.

There will be (are in fact) one-click, cloud-hosted solutions for analysis, but given how quickly tools have evolved in this space, there will always be groups wanting to run on their own hardware so as to experiment with the latest new developments.

heuermh · on Nov 14, 2015

Agreed.

Hope you don't mind a plug here for ADAM, a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark and Parquet.

https://github.com/bigdatagenomics/adam

alexchamberlain · on Nov 11, 2015

This seems to be fixing the wrong problem. Packaging software is not hard, but it does need to be learnt and tutorials are scarce.