I recently read that there are over 100 workflow engines [1], it is still often ...

madhadron · on March 2, 2019

It's easy to make the case that make is insufficient. There are three very distinct domains being conflated here:

1. The day to day experimentation and iteration by a computational researcher. 2. The repeated execution of a workflow on different data sets submitted by different people, such as in a clinical testing lab. 2. The ongoing processing of a stream of data by a deployed system, such as ongoing data processing for a platform like Facebook.

For (1), there is a crucial insight that is often missing: the unit of work for such people is not the program, but the execution. If you have a Makefile or a shell script or even a nicely source controlled program, you end up running small variations of it, with different parameters, and different input files. Very quickly you end up with hundreds of files, and no way of tracking what comes from which execution under what conditions. make doesn't help you with this. Workflow engines don't help you with this. I wrote a system some years ago when I was still in computational science to handle this situation (https://github.com/madhadron/bein), but I haven't updated it to Python 3, and I would like to use Python's reflection capabilities to capture the source code as well. It should probably be integrated with Jupyter at this point, too, but Jupyter was in its infancy when I did that.

For (2), there are systems like KNIME and Galaxy, and, crucially, they integrate with a LIMS (Laboratory Information Management System) which is the really important part. The workflow is the same, but it's provenance, tracking, and access control of all steps of the work that matters in that setting.

For (3), what you really want is a company wide DAG where individuals can add their own nodes and which handles schema matching as nodes are upgraded, invalidation of downstream data when upstream data is invalidated, backfills when you add a new node or when an upstream node is invalidated, and all the other upkeep tasks required at scale. I have yet to see a system that does this seriously, but I also haven't been paying attention recently.

For none of these is chaining together functions with error handling and reporting the limiting factor. It's just the first one that a programmer sees when looking at one of these domains.

marmaduke · on March 3, 2019

Ok but none of the those situations can be addressed with zero budget. If you’ve got those problems to solve then you usually ha e a budget to build/buy an appropriate tool.

At the other end of the spectrum you have every small team with some data analysis steps producing their own workflow engine when Make would be just fine.

I agree however the streaming case is particularly poor, but consider that paired with an appropriate fuse file system Make can address most use cases.

m0zg · on March 3, 2019

> what you really want is a company wide DAG where individuals > can add their own nodes and which handles schema > matching as nodes are upgraded, invalidation > of downstream data when upstream data is invalidated

I've never seen this work in practice, and doubt it can work, due to the complexities involved.

ematvey · on March 3, 2019

It really helped in our case. We have a team of 10+ researchers who alse ship code in production. They were repeatedly running into a problem where they recompute same data in runtime, or reinvent the wheel because they didn’t know somebody already computed that datum. I end up writing a small single-process (for now) workflow engine running a “company-wide DAG” of reusable data processing nodes (all derived from user-submitted input + models). Now it is much easier for individuals to contribute + much easier to optimize pipelines separately. I might open source it some time soon.

madhadron · on March 3, 2019

It's what is done de facto by large enough groups anyway. They just have to kludge tooling together for it.

zmmmmm · on March 2, 2019

It's not so much that Make is insufficient; there are a huge number of reasons to use something over make. However its true the difference between what you can implement with any reasonable amount of work and Make doesn't usually justify not using something simple that everyone understands (like Make). The next step up "worth making" involves hundreds of features that go deep into territory most people don't realize exists when they start re-inventing this particular wheel.

scribu · on March 2, 2019

Thanks for the link! That lightweight SoS notebook seems like the sweet spot between agility and tidiness.