<quote>This looks fun and certainly simple - but I would guess that for ...

<quote>This looks fun and certainly simple - but I would guess that for many, the actual training of the model is not the show-stopped before "automated, data-driven decisions and data-driven applications are going to change the world."</quote>

totally agree, indeed whenever I train a machine learning model (for a ranker or a classifier) I spend most of the time building the workflow to generate the datasets and extract and compute the features. I actually haven't found yet a good open source product that cares about that, last time I had to work on a ML related stuff I relied on Makefiles and a few Python scripts to distribute the computation in a small cluster. I needed a more powerful tool for doing that so during my spare time I've tried to build something similar to what I've in my mind. I came out with a prototype here: https://bitbucket.org/duilio/streamr . The code was mostly written the first day then I did a few commits to try how it could work in a distribute environment. It is in a very early stage and need a massive refactoring, it is just a proof of concept. I'd like my workflows to look like https://bitbucket.org/duilio/streamr/src/26937b99e083/tests/... . The tool should take care of distribute the workflow nodes and cache the results, so that you can slightly change your script and avoid to recompute all the data. I hadn't used celery before, maybe many of the stuff I've done for this prototype could have been avoided (i.e. the storage system could have been implemented as celery cache)