I feel so frustrated by the emergence of these things and the constant attempt t...

mmq · on March 2, 2019

I might be wrong, and I also highly value and use the tools that you mentioned. However, I have to say that I see the emergence of these tools as a positive thing for 2 reasons.

The first one I think is that more and more people are now using/trying to use machine learning models in production and they discover that the workflows and tools they used to use and work with are not suited for delivering machine learning models in fast, repeatable, and simple way.

The second reason is that I objectively think that a machine learning pipeline or CI/CD system is a bit different than the one used for pure software engineering practices, partly because machine learning does not only involve code, but more layers of complexity: data, artifacts, configuration, resources... All these layers can impact the reproducibility of a "successful build". Hence, a lot of engineering is required to both ensure that teams can achieve both reproducible and reliable results, and increase their productivity.

mlthoughts2018 · on March 2, 2019

I am a long-time practitioner of putting machine learning tools into production, improving ML models over time, doing maintenance on deployed ML models, and researching new ways to solve problems with ML models.

All I can say is that in based on my experience, I would dramatically disagree with what you wrote.

I’ve always found pre-existing generalist engineering tooling to work more efficiently and cover all the features I need in a more reliable and comprehensive way than any of the latest and greatest ML-specific workflow tools of the past ~10 years.

I’ve also worked on many production systems that do not involve any aspects of statistical modeling, yet still rely on large data sets or data assets, offline jobs that perform data transformations and preprocessing, require extensibly configurable parameters, etc. etc.

I’ve never encountered or heard of any ML system that is in any way different in kind than most other general types of production engineering systems.

But I have seen plenty of ML projects that get bogged down with enormous tech debt stemming from adopting some type of fool’s gold ML-specific deployment / pipeline / data access tools and running into problems that time-honored general system tools would have solved out of the box, and then needing to hack your own layers of extra tooling on top of the ML-specific stuff.

marmaduke · on March 2, 2019

I was going to make a less general comment along these lines, that I have put models to production with GitLab CI, Make and Slurm, and it keeps us honest and on task. There’s no mucking about with fairy dust data science toolchains and no excuses not to find a solution when problems arise because we’re using well tested methodology on well tested software.

anigbrowl · on March 3, 2019

But I don't want to learn 50 different tools that are all best-in-class. My use case is social media analysis of specific communities with fairly limited resources, so every hour that I spend on tooling is time not spent observing my subjects.

You're not wrong about the value of all the different tools you mention, but I think overlooking the integration and maintenance costs that a specialty tool can reduce, at the expense of some flexibility. I think that's the same reason many people prefer an IDE.

mlthoughts2018 · on March 3, 2019

Learning the time tested tools almost always involves spending less time setting up / reading tutorials / etc. The time sink of betting the farm on latest and greatest data science frameworks is often gigantic and gets worse over time.