The full machine learning pipeline goes far beyond sklearn scope. Think about scheduled data reading or scheduled model updates. I think there is a need for such frameworks to help using ml.
Going far beyond the scope of sklearn pipelines isn't necessarily a good thing. I don't really see what there is to be gained by making scheduling part of the remit of pipelines.
I'd rather assemble together a set of tightly scoped "UNIX philosophy" libraries and tools rather than try and use an all encompassing framework and be straitjacketed by its imposed structure.
At work I have what are essentially cron jobs running scripts which invoke sklearn pipelines. I've never even thought to make the scheduler aware of what they were running and I'm not sure why I would.
> "I don't really see what there is to be gained by making scheduling part of the remit of pipelines."
I think it depends on the use case, sometimes the components of the pipeline aren't necessarily running on the same machine, and they don't know where and how to get access to data and artifacts generated by previous steps, and so scheduling and orchestration becomes an important component of the pipeline itself.
> "I'd rather assemble together a set of tightly scoped "UNIX philosophy" libraries and tools rather than try and use an all encompassing framework and be straitjacketed by its imposed structure."
I think the idea behind building such frameworks is to help people avoid going through the same steps of building such tool internally by "assembling together a set of tightly scoped "UNIX philosophy" libraries". In general these frameworks are using libraries and tools, and exposing an easy way to leverage them instead of spending time doing that over and over.
>At work I have what are essentially cron jobs running scripts which invoke sklearn pipelines.
If cron works for you, that’s great, and you should continue to use it. However, I would be interested to know how many data sources you have, how you handle failures in pipe segments, and your general throughput.
In more complicated flows, ones that require different different data sets to to be combined, or lots of data flows that depend on each other, moving to a DAG with event triggering is a much better setup in my experience. Data is generated faster, and errors are handled more gracefully, and recovery much faster since data is only recalculated when needed.
I don't actually use cron directly. What I do use is capable of scheduling and error detection (nonzero errors). Even if it weren't, the script it invoked could do both with < 4 lines of code.
I think it would actually be professionally negligent to introduce coupling at this point.