I feel so frustrated by the emergence of these things and the constant attempt to brand them or stylize them towards data science.
There are just engineering projects. There are not any other things.
For some engineering projects, you need to support dashboard-like, interactive interfaces that depend on data assets or other assets (like a database connection, a config file, a static representation of a statistical model, whatever). Sometimes you need a rapid feedback system to investigate properties of the engineering project and deduce implications for productively modifying it. These are universal requests that span tons of domains, and have very little to do with anything that differentiates data science from any other type of engineering.
At the level of an engineering project, you should use tools that have been developed by highly skilled system engineers, for example like Make or Bazel, or devops tools for containers or deployment and task orchestration, like luigi, kubernetes tools, and many others.
For a web service component, you should use web service tooling, like existing load balancing tools, nginx, queue systems, key value stores, frameworks like Flask.
For continuous integration or testing, use tools that already exist, like Jenkins or Travis, testing frameworks, load testing tools, profilers, etc.
Stop trying to stick a handful of these things into a bundle with abstractions that limit the applicability to only “data science” projects, and then brand them to fabricate some idea that they are somehow better suited for data science work than decades worth of tools that apply to any kind of engineering project, whether focused on data science or not.
I might be wrong, and I also highly value and use the tools that you mentioned. However, I have to say that I see the emergence of these tools as a positive thing for 2 reasons.
The first one I think is that more and more people are now using/trying to use machine learning models in production and they discover that the workflows and tools they used to use and work with are not suited for delivering machine learning models in fast, repeatable, and simple way.
The second reason is that I objectively think that a machine learning pipeline or CI/CD system is a bit different than the one used for pure software engineering practices, partly because machine learning does not only involve code, but more layers of complexity: data, artifacts, configuration, resources... All these layers can impact the reproducibility of a "successful build". Hence, a lot of engineering is required to both ensure that teams can achieve both reproducible and reliable results, and increase their productivity.
I am a long-time practitioner of putting machine learning tools into production, improving ML models over time, doing maintenance on deployed ML models, and researching new ways to solve problems with ML models.
All I can say is that in based on my experience, I would dramatically disagree with what you wrote.
I’ve always found pre-existing generalist engineering tooling to work more efficiently and cover all the features I need in a more reliable and comprehensive way than any of the latest and greatest ML-specific workflow tools of the past ~10 years.
I’ve also worked on many production systems that do not involve any aspects of statistical modeling, yet still rely on large data sets or data assets, offline jobs that perform data transformations and preprocessing, require extensibly configurable parameters, etc. etc.
I’ve never encountered or heard of any ML system that is in any way different in kind than most other general types of production engineering systems.
But I have seen plenty of ML projects that get bogged down with enormous tech debt stemming from adopting some type of fool’s gold ML-specific deployment / pipeline / data access tools and running into problems that time-honored general system tools would have solved out of the box, and then needing to hack your own layers of extra tooling on top of the ML-specific stuff.
I was going to make a less general comment along these lines, that I have put models to production with GitLab CI, Make and Slurm, and it keeps us honest and on task. There’s no mucking about with fairy dust data science toolchains and no excuses not to find a solution when problems arise because we’re using well tested methodology on well tested software.
But I don't want to learn 50 different tools that are all best-in-class. My use case is social media analysis of specific communities with fairly limited resources, so every hour that I spend on tooling is time not spent observing my subjects.
You're not wrong about the value of all the different tools you mention, but I think overlooking the integration and maintenance costs that a specialty tool can reduce, at the expense of some flexibility. I think that's the same reason many people prefer an IDE.
Learning the time tested tools almost always involves spending less time setting up / reading tutorials / etc. The time sink of betting the farm on latest and greatest data science frameworks is often gigantic and gets worse over time.
There are just engineering projects. There are not any other things.
For some engineering projects, you need to support dashboard-like, interactive interfaces that depend on data assets or other assets (like a database connection, a config file, a static representation of a statistical model, whatever). Sometimes you need a rapid feedback system to investigate properties of the engineering project and deduce implications for productively modifying it. These are universal requests that span tons of domains, and have very little to do with anything that differentiates data science from any other type of engineering.
At the level of an engineering project, you should use tools that have been developed by highly skilled system engineers, for example like Make or Bazel, or devops tools for containers or deployment and task orchestration, like luigi, kubernetes tools, and many others.
For a web service component, you should use web service tooling, like existing load balancing tools, nginx, queue systems, key value stores, frameworks like Flask.
For continuous integration or testing, use tools that already exist, like Jenkins or Travis, testing frameworks, load testing tools, profilers, etc.
Stop trying to stick a handful of these things into a bundle with abstractions that limit the applicability to only “data science” projects, and then brand them to fabricate some idea that they are somehow better suited for data science work than decades worth of tools that apply to any kind of engineering project, whether focused on data science or not.