<quote>This looks fun and certainly simple - but I would guess that for many, the actual training of the model is not the show-stopped before "automated, data-driven decisions and data-driven applications are going to change the world."</quote>
totally agree, indeed whenever I train a machine learning model (for a ranker or a classifier) I spend most of the time building the workflow to generate the datasets and extract and compute the features. I actually haven't found yet a good open source product that cares about that, last time I had to work on a ML related stuff I relied on Makefiles and a few Python scripts to distribute the computation in a small cluster. I needed a more powerful tool for doing that so during my spare time I've tried to build something similar to what I've in my mind. I came out with a prototype here: https://bitbucket.org/duilio/streamr . The code was mostly written the first day then I did a few commits to try how it could work in a distribute environment. It is in a very early stage and need a massive refactoring, it is just a proof of concept. I'd like my workflows to look like https://bitbucket.org/duilio/streamr/src/26937b99e083/tests/... . The tool should take care of distribute the workflow nodes and cache the results, so that you can slightly change your script and avoid to recompute all the data. I hadn't used celery before, maybe many of the stuff I've done for this prototype could have been avoided (i.e. the storage system could have been implemented as celery cache)
Cool project on your side of things. I'm spending a bit of time myself trying to put together a compositional tool chain for machine learning tasks myself. Are there any major design choices you've thought out for your own project youd care to expound upon?
totally agree, indeed whenever I train a machine learning model (for a ranker or a classifier) I spend most of the time building the workflow to generate the datasets and extract and compute the features. I actually haven't found yet a good open source product that cares about that, last time I had to work on a ML related stuff I relied on Makefiles and a few Python scripts to distribute the computation in a small cluster. I needed a more powerful tool for doing that so during my spare time I've tried to build something similar to what I've in my mind. I came out with a prototype here: https://bitbucket.org/duilio/streamr . The code was mostly written the first day then I did a few commits to try how it could work in a distribute environment. It is in a very early stage and need a massive refactoring, it is just a proof of concept. I'd like my workflows to look like https://bitbucket.org/duilio/streamr/src/26937b99e083/tests/... . The tool should take care of distribute the workflow nodes and cache the results, so that you can slightly change your script and avoid to recompute all the data. I hadn't used celery before, maybe many of the stuff I've done for this prototype could have been avoided (i.e. the storage system could have been implemented as celery cache)