Hacker News new | past | comments | ask | show | jobs | submit login
Netflix's Metaflow: Reproducible machine learning pipelines (cortex.dev)
254 points by ChefboyOG on Dec 21, 2020 | hide | past | favorite | 101 comments



If you are curious about how Netflix uses Metaflow to power behind-the-scenes machine learning, take a look at this recent blog article https://netflixtechblog.com/supporting-content-decision-make...

Also I'm happy to answer any questions (I lead the Metaflow team at Netflix).


Hey, been meaning to reach out.

There's a bit in the Metaflow docs that talks about choosing resources, like RAM: "as a good measure, don't request more resources than what your workflow actually needs. On the other hand, never optimize resources prematurely."

The problem is that for memory, too little means out-of-memory crashes, so the tendency I've seen is to over-provision memory, which ends up getting very expensive at scale.

This choice between "my process crashes" and "I am incentivized to make my process organizationally expensive" isn't ideal. Do you have any ways you deal with this at Netflix, or have you seen ways other Metaflow users deal with it?

I have some ideas on how this could be made better (some combination of being able to catch OOM situations deterministically, memory profiling, and sizing RAM by input size for repeating batch jobs), based in part on some tooling I've been working on for memory profiling: https://pythonspeed.com/fil, so would love to talk about it if you're interested.


I'd love to hear more what you have in mind! Feel free to drop by at our chat at https://gitter.im/metaflow_org/community

While it is true that auto-sizing resources is hard and the easiest approach is to oversize @resources, the situation isn't as bad as it sounds:

1) In Metaflow, @resource requests are specific to a function/step, so you end up using resources only for a short while typically. It would be expensive to keep big boxes idling 24/7 but that's not necessary.

2) You can use spot instances to lower costs, sometimes dramatically.

3) It is pretty easy to see the actual resource consumption on any monitoring system, e.g. CloudWatch, so you can adjust manually if needed.

4) A core value proposition of Metaflow is to make both prototyping and production easy. While optimizing resource consumption may be important for large-scale production workloads, it is rarely the first concern when prototyping.

In practice at Netflix, we start with overprovisioning and then focus on optimizing only the workflows that mature to serious production and end up being too expensive if left unoptimized. It turns out that this is a small % of all workflows.


One way I deal with this problem in an alternative workflow manager (Nextflow), is by calculating the memory requirement for the ~95th percentile of a job, and submitting with a rule "If this crashes from going OOM, re-submit with memory*N" (up to some max number of retries/RAM). This lets most jobs sail through with a relatively low amount of RAM, and the bigger jobs end up taking a bit more time and resources.

The better your estimator function, of course, the tighter constraints you can use.


In many cases the memory usage is linear with input data, so you can come up with a function that predicts memory usage, add some padding, and then you don't need retries. E.g. this example here: https://pythonspeed.com/articles/estimating-memory-usage/


The Spark community (LinkedIn) developed Dr Elephant to profile jobs and provide suggestions for reducing memory/cpu consumption. Metaflow would need something similar:

https://github.com/linkedin/dr-elephant


I've seen that, yeah. I've already implemented a memory profiler for Python batch jobs (https://pythonspeed.com/fil), but starting to think about how to integrate it into specific pipeline frameworks.


It would be great if the infra layer can provide some help on automated resource scaling, especially for RAM. The ML solver/tooling layer has also been making progress on this front, for example Dask for limited-RAM pandas, h2o.ai has limited RAM solvers, xgboost has an external memory version, pytorch/tensorflow models are mostly trained on SGD and only needs to load data batch by batch. It's nice that Metaflow can integrate with any python code and thus benefit from all of the efforts made on the solver/tooling layer.


What is Metaflow's explicit support for Transfer Learning tasks? In other words, how do I know what models to use or not use? I am surmising from the techblog post that there is a stable set of content-intrinsic features, and that can be separated from perhaps more dynamic features-sets that characterize audiences, presentation treatment, and viewing (as conditioned on all the other stuff). But it sounds like there is a stable set of features for prediction tasks, as well, which is to say that for a task like predicting an audience for movie X in region Y, you'll need some set of features, and that we have some set of trained models (and recommend analytic components) available that match some or all of those features for this task. Is that a "thing", or is the workflow support simpler than that, and should that be a "thing"?


Good question! What you are asking is pretty much the core question for a certain set of ML tasks at Netflix.

Metaflow is rather unopinionated about those types of questions, since they are subject to active research and experimentation. Metaflow aims to make it easy to conduct the research and experiments but it is up to the data scientist to choose the right modeling approach, features etc.

In some cases, individual teams have built a thin layer of tooling on top of Metaflow to support specific problems they care about. I could imagine such a layer for specific instances of transfer learning, for instance.

In general, we are actively thinking if/how Metaflow could support feature sharing in general. It is a tough nut to crack.


Hi! How do you handle floating point determinism? Can some ML be reproduced in any architecture? Can you build the code with another compiler version? Can you use newer SIMD instructions?

Or you're forever tied to the initial hardware+compiler version?


Reproducibility is a spectrum. A good starting point is to snapshot the exact version of the code that produced a model. Even better, you should snapshot the full dependency graph, including transitive dependencies, of all (compiled) libraries, which Metaflow does with @conda. Together with data snapshots, this gives a pretty good foundation for reproducibility.

Depending on the libraries you use, the exact results may or may not be reproducible on other architectures. If cross-platform reproducibility is important to you, you should choose your libraries accordingly. Metaflow provides the tools for choosing the level of reproducibility that your application requires.


Is there any library in the metaflow ecosystem that offers floating point determinism? I would like to read more, can you link somewhere?

Also, what tools does metaflow offers to control the level of reproducibility?


Hi Ville, thanks for coming on here to answer questions. I see that Metaflow has been made compatible with R now. Are there any plans to do the same with Julia?


No plans but it should be possible technically. We'd need help from the Julia community :)


Are there any built-in dashboards for such actions as querying or analysing model versions and the metadata around them?


Coming soon!


Edit: this is a somewhat OT rant.

Netflix’s recommender system is hands down the worst I have ever seen. Every single thing I watch, it suggests The Queen’s Gambit and two other random Netflix productions.

Even if I watch the first of a trilogy (LotR, for example). How can they be so terrible at this?

The categories in the main browsing view are also hysterically arbitrary. It kind of looks like a topic model with back-constructed titles for the topics.

Finally, they replaced the ratings with “% matching”. I guess so they can recommend their subpar productions even if they get low ratings.


>Finally, they replaced the ratings with “% matching”. I guess so they can recommend their subpar productions even if they get low ratings.

That's always what ratings were. People didn't understand that (as you can see), so they changed it to make it more transparent.

https://www.businessinsider.com/why-netflix-replaced-its-5-s...

>Netflix’s star ratings were personalized, and had been from the start. That means when you saw a movie on Netflix rated 4 stars, that didn’t mean the average of all ratings was 4 stars. Instead, it meant that Netflix thought you’d rate the movie 4 stars, based on your habits (and other people's ratings). But many people didn’t get that.


It's not just that (obviously, since It's not even possible for me to rate a movie more granular than thumbs up/down)

What happened is that the notion changed from "predict a scalar rating" to "predict a binary satisfaction."

As parent poster noted, the effect of this is to push "3 star" and "4 star" acceptable shows to the user, instead of "5 star" great (in the user's view) shows.

Also, in Netflix's defense, users are horribly inconsistent in their expressed ratings (they'll rate a movie based on their personal mood at the time, and they'll binge shows they claim aren't 5 stars while ignoring their 5 star movies)


This was a sad turning point for me. We used to have a single streaming platform with an awesome library, a granular review system, and user reviews. You could easily take a quick look to read other users thoughts on a film. Now I have to lookup metacritic / reviews myself, and the NF recommendation is based on whether I've said something is 'palatable enough to watch, isn't terrible, but I'd never watch it again' (thumbs up). I've taken to only thumbs upping stuff that I particularly like to see if that's any better. It all seems to be the same from an uninformed end user perspective.

I remember they stated the reviews would still be available in some form for export.

It's no longer the Netflix of old imo.


I've wondered if the lack of this stuff is due to business contracts or internal product goals. Not having e.g. IMDB makes sense since it is owned by a competitor (Amazon; whether that's a USA Trust thing, who knows)


Also, in Netflix's defense, users are horribly inconsistent in their expressed ratings (they'll rate a movie based on their personal mood at the time, and they'll binge shows they claim aren't 5 stars while ignoring their 5 star movies)

This is exactly why they switched from 5 stars to up/down/blank.


Not just the recommender, but the design is horrible for me. I can't rest my mouse anywhere, it auto-starts something or enlarges something, it's all too twitchy. When you're watching an episode, there's no navigation link from the play screen to the main page of the series. As if they don't want us to navigate the site, instead be led on their happy path.


Good news is that you can disable autoplay of previews [0]. Bad news is that it takes weeks to propogate this setting change to all clients. I had to wait about two weeks for my apply tv client to stop auto previews. I suspect the queue service is powered by snail mail.

[0] https://help.netflix.com/en/node/2102


That must be because of an excessive amount of caching on the client-side


You would hope that logging out and back in would refresh these settings.


It does not.


I think so. This way, you don't easily arrive to the conclusion the items I want to look for all don't exist here.


I long ago ceased to believe that the Netflix recommender system serves any other purpose than to fulfill the company's internal obligations to push favored content, depending on what it cost. Sadly, the same is now true for Amazon Prime, which is an even hotter mess.


I turned on "Super Wings" for my kid to watch on Prime Video which at first glance seemed to be a fairly decent Paw Patrol knock off, but then as I listened to the episodes in the background, I realized that the entire show is basically an advertisement for Amazon Prime in disguise. Seriously look it up, the entire premise of the show is people ordering packages and the "Super Wings" delivering the packages to the consumer as quickly as possible... .


Seems like you got what you expected... Super Wings is the corporate side of the Government-Corporate complex.


That's exactly what's happening. Source: you can imagine.


Maybe it's different in other countries (I am in the UK), but I feel as if there isn't enough content for a recommender system to even be useful. I feel like after browsing through the catalogue a few times, I have a rough idea of most things I would ever possibly be interested in. There's just not that much there. Either that, or the recommender system is working too well and I never see anything beyond what Netflix wants me to.


I have to agree. As far as I recall the Netflix search engine has never found the exact title I've been searching for. Even the recommendations it then shows me are so far off piste that they're not really even close to what I'm looking for.

At least with Amazon Prime there's a high chance they'll at least find the title one searches for and if I am really motivated I can pay a few bucks to watch it.

Netflix just draws a blank.

That's not to say I haven't watched some entertaining things on Netflix, but they seem much better suited to TV series than movies, and I almost feel like I found decent things to watch inspite of their recommendation system, not because of it.

An ML tool from Netflix? It feels like the last thing I'm likely to use.


>Netflix’s recommender system is hands down the worst

Until you log into prime video.

Can’t manage to give me a “continue watching last thing button”. That’s literally the most likely thing I want to watch. Also routinely suggest starting with S02 even though I’ve not watched S01.

Never mind machine learning some common sense would be greatly appreciated


Netflix actually organized a major machine learning competition more than a decade ago[1] with thousands of the best researchers trying to beat some internal benchmark by a few percent, for $1M prize, I'm wondering where did all that "learning" go.

My guess is that whatever rankings they currently produce maximize some internal revenue target and that target' user base is not me or you.

[1] https://en.wikipedia.org/wiki/Netflix_Prize


Yup. It just ends up irritating users though if the suggestions break basic common sense.

If I watched S01E04 yesterday I want to watch S01E05 today. The interface should be suggesting that PLUS whatever else the ML comes up with in addition, not instead of.

Prime is full of minor irritations like that make me wonder whether Amazon engineers dogfood enough


Their recommender system likely has multiple inputs that are under specific constraints and weightings. How well their recommendation algorithms work will probably never be known by the general public as we are force-fed a steady diet of whitelisted staff picks and promotional items.


So what's the point of making tools like Metaflow? Justifying salaries for their engineers I suppose?


I see this complaint about poor recommendations very often.

But the recommendations seem to work perfectly for me - I wonder whether that is the case for the silent majority?

The match % is usually spot on for me, and I've never seen it recommend any titles I've given a thumbs-down to.


For many of the complainers, Netflix is showing them something they would actually enjoy. How to communicate that is challenging. I remember Pandora’s generated stations having a similar issue years ago, where it would catch similarities between on-trend bands and bands that had fallen out of favor. It was right, but it could sometimes be hard to submit to the insight of the algorithm. Spotify is one that I think does this well as they understand how not to give a recommendation that insults or offends, even if it’s technically an accurate response.


Do you have a counter example of a streaming service that does recommendations better?


Back when I used Netflix primarily for DVDs, the recommender system worked pretty well for me.

Much later, when they switched to simple thumbs up/down, the recommender system was entirely useless to me. (Not merely because of the dumbed-down rating system; the recommendations were genuinely bad.)

For the time in between, I'm not sure if the degradation was gradual, sporadic, or not degraded at all.


It's pretty clear that their recommendation system and broader UX is designed, at least in part, to obfuscate how much content they have and how good it is. Back when it was DVDs and they had basically everything it was more about finding the next best thing for you. Now it's finding the next best thing they have and, preferably, something they own the rights to.


If you add dvd subscriptions you can still get the original Netflix recommendations back, including sorting by top predicted rating (which to me is eerily amazing). I used to keep the dvd subscription mainly for the recommender, with the delivered blurays considered an extra bonus for very rare movies you couldn't stream even if you wanted to pay.


It’s not a home run for me, but I think Spotify recommendations are quite good. They clearly use some form of content based recommendation (extracting features from the music itself) blended with other methods. It seems to make an honest attempt at serendipity (songs/artists you may like but would otherwise be unlikely to discover).

I still think recommender engines should always enable some form of user tuning. If it doesn’t, then the recommender is a tool for services to control your behavior rather than the other way around.


Unless I’m missing this functionality somewhere (entirely possible), this lack of user-tuning has ruined Spotify recs for me, since I listen to entirely different playlists when working or meditating. I don’t check out recommendations to get the latest binaural beats or nature sounds, you know?

Although even before I started listening to Spotify while working etc, it seemed to have run out of things to recommend. My weekly discover playlist would be half things I’d already liked. So...who knows. But I miss the discovery functionality quite a bit.


I personally have a similar problem with the Spotify generated playlists, but I have persnickety preferences in electronic music so it often whiffs. But on the spectrum of recommendation engines it is on the side of honest effort (whereas Netflix is not). And for most users and musical palates I think it works great. Just yesterday my mom complimented me on my Christmas music DJ skills, but it was just the generated continuation of her own playlist.

Nothing really beats the recommendations of a human curator with exquisite taste, and just listening to new music nonstop and plucking out the gems as you go.


I like Spotify's recommendations and have found lots of great artists from it. But now I feel I'm in a kind of Spotify-created rut of listening to static groups of artists. I have mostly stopped listening to my daily mixes because of this.

It's also easy to theorize about "big media" controlling what I listen to, so I still feel the need to do my own exploring, even when I'm getting recommended good fresh stuff.


The basic problem is understanding WHY I liked something. If I watch Tintin because it’s a cozy throwback to my childhood, that does not mean I would like every single Studio Ghibli movie in my recommendations.

Similarly, if I play Blacklist in the background as basically noise, I don’t want to see a bunch of related shows. I guess I could give it a thumbs down but I only do that for actually terrible movies.

Also Spotify and Apple music seem to have okay recommendations


Music is different from movies and television though. There are beats and rhythms that are easy to identify, lyrics easy to analyze and artists that are roughly categorized.


I feel like that’s not true, music is hard to analyze. Movies have synopses, and are generally categorized by their cast alone. I could figure out LotR belongs together from these datapoints, easily. Not to mention the fact that people watch trilogies in sequence like 99% of the time.


Recommendations maybe not, but I love that in HBO I can sort by IMDB score to find great movies/series that I haven't watched yet. I just watched Broadwalk Empire as an example, and loved it.

Netflix doesn't show IMDB scores, so I have to check it always independently...it sucks.


Not a streaming service, but a lot of movie and TV databases tend to have decent recommendations for movies and shows similar to the one you're viewing. Although in this case, one could argue that the processing is offloaded to users who provide the recommendations.


I read this as a condemnation of recommender systems in general.

That or a ceding of recommendations to marketing. That feels too cynical, but more accurate.


Spotify's Discover Weekly has a pipeline straight into my brain and not only what I like, but what I will like in the future.


The MovieLens recommender system works pretty well.


Spotify. Discover Weekly is very good, particularly considering they have only one chance to recommend with 30-40 songs, and quite a large universe to match on.

Not a streaming service but Google's Discover news is also very good (probably the best recommendations I have come across).


Netflix's biggest problem is not the recommendation engine, but their lack of content. If they allowed you to filter out the junk you don't want to see and what you've already seen, there would be hardly anything left. They've been losing rights to stream left and right. That's why there's such a rush to produce their own content in other countries. Some of that is halfway decent, but there's a lot of formulaic, repetitive stuff.

And while saying that, I appreciate their high level technical staff. These decisions are made by bean counters.


> Every single thing I watch, it suggests The Queen’s Gambit and two other random Netflix productions.

I always assumed those three were human selections, not part of the recommendations.

Like the paid ads on top of your search results.


> recommender system is hands down the worst I have ever seen.

In my limited experience, they are all dismal and best ignored entirely.

When it was easier to see something plausibly like a real user review, that helped I guess.


I don't think it's the worst, it's just more that you've made the assumption that Netflix wants a pure set of recommendations. Marketing and business drivers will always trump algo results, so what you're seeing are mostly artificial boosts given to globally 'hot' properties.


Can we please take these off-topic rants somewhere else? Lately, it's hard to find any insightful comments in the disucssion section of engineering or technical articles. Because the whole conversation is derailed by something unrelated off-topic rants. There are a lot of social media, forums and even HN submissions where everyone can rant. Please keep the technical submissions clean.


Fully agree with this. Yuck, Netflix recos are ultra bad.


Netflix often promotes new content by overzealously recommending it, the constant suggestion you see for The Queen’s Gambit is probably an ad.


I really think that if they gave up all the complicated algorithms and went with a simple algorithm out of the 90s we'd be much happier with the recommendations.


Do you have such an algorithm?

Netflix doesn't make money by showing people stuff they don't like.


They actually might... think of it like gym memberships. Sure you need a hook, but as Mandalorian showed, that can be a single exclusive tv show that doesn’t require much broadband cost. Maybe not the best solution long term but since when has that been the shareholders goal?


Not for a lack of trying though...


Setting up a decent, comprehensive, self-hosted (!) ML environment is still extremely, frustratingly difficult.

What I really want is a single solution, or a set of pluggable, integrated components that offer:

* training data and model storage (on top of a blob store like S3, minio, ...)

* interactive dev environments (Notebooks, dev containers, ...)

* training (with history, comparisons, parameters, ...) with experiments for parameter tuning

* serving/deploying for production

* a permission system so researchers and developers can only access what they are supposed to

* software heritage, probably via Docker images or Nix packages, combined with source code references

* (cherry on top: some kind of integrated labeling system and UI)

Right now you have to cobble this together from different tools that are all pretty suboptimal.

The big players can set up sophisticated systems, but I'm curious to hear how other startups are currently solving this.


Hi @the_duke,

disclaimer I am one of the authors of an open-source solution (https://github.com/polyaxon/polyaxon) that specializes in the experimentation and automation phase of the data-science lifecycle.

Our tool provides exactly the kind of abstraction you mentioned:

* Training, data operations, and interactive workspaces (https://polyaxon.com/docs/experimentation/)

* A scalable history and comparison table (https://polyaxon.com/docs/management/runs-dashboard/comparis...)

* Currently pipelines and concurrency management is on the commercial version (https://polyaxon.com/docs/automation/) but several companies use Polyaxon with other tools like Kubeflow (https://medium.com/mercari-engineering/continuous-delivery-a...) or it can be used with MetaFlow for the pipelines part.

I would really like to hear your thoughts and feedback.


> but several companies use Polyaxon with other tools like Kubeflow or it can be used with MetaFlow for the pipelines part.

Isn't that what the parent you are replying to is talking about with "Right now you have to cobble this together from different tools that are all pretty suboptimal."


Our tool provides several solutions, however, we do not force users to use all of these abstractions. It's very important for us that our product is interoperable with the rest of the ecosystem.

If a company is already using a pipelining tool, a visualization tool, or a data management tool, Polyaxon will work and integrate with those tools seamlessly.

That being said, and I fully understand where the OP is coming from, there are several companies not interested in managing several solutions and all the complexity that comes with the infrastructure, deployment, maintenance, upgrades, user facing clients, authn/authz, permissions... Polyaxon provides the right abstractions for covering the experimentation and the automation phase.


I think the solution to this is a bunch of pluggable tools that integrate well. "AI Platforms" do everything, but they do each thing not very well and force you into a particular way of working. (There is a reason we don't use "software platforms" any longer.)

But unfortunately, as you say, most of the pluggable tools are not very good and/or not mature enough.

Here's our attempt at model storage, experiment tracking, and software heritage: https://replicate.ai/

For interactive dev environments, Colab, Deepnote, and Streamlit are all great.

For deploying to production, Cortex mentioned in the post is great.

All are a work in progress, but I think we'll soon have a really powerful ecosystem of tools.


Same - I'm a SWE embedded in a small (but growing) ML team. We have all of the same problems.

It seems that the "all-in" platforms are too "rigid", and all of the point solutions for the things you mentioned aren't proven enough.


I think that by definition this is a tradeoff. Most times you talk to data scientists that want a fully automated end-2-end solution, that doesn't require they change anything about their current workflow, and that any future modifications to their workflow would be supported as well.

That is magical thinking. I prefer best of breed solutions that integrate nicely with other best of breed solutions every day. That way if a tool doesn't suit you tomorrow, you can relatively easily swap it out for something better


You should checkout Hopsworks (disclaimer: work on it). It does all of the above (including a Feature Store, notebooks as jobs, airflow for ML pipelines, model serving (TensorFlowServing, Flask, and soon KFserving), experiments, a project-based multi-tenancy model that supports sensitive data on a shared cluster, and a UI. It does not have a labelling system - but you can pip install libraries. You don't need to learn Docker - each project has a conda environment, and we compile Docker images for projects transparently, so jobs are tied to Docker images, but you don't need to write a Dockerfile (this is a huge win for data scientists). You can run Python jobs (connect to a k8s cluster) or Spark/Flink jobs (on Hopsworks itself).

Open-source:

* https://github.com/logicalclocks/hopsworks

Managed platform on AWS/Azure (with elastic compute/storage, integration with managed K8s, LDAP/AD):

* https://hopsworks.ai


I'd love to hear what you think about https://dagshub.com/. We're building it with community collaboration in mind. It doesn't cover all the bases you mention, but we do: * data and model storage * experiment tracking * pipeline management * access control * data, model, code, pipeline versioning

We're also strictly based on Git and other Open Source formats and tools so connecting with other tools you use like Colab for IDEs or Jenkins/Kubeflow for training is super straightforward (we have examples for some)


Others have mentioned some cool projects in this space, but you mentioned self hosted specifically so I’ll share what we’re working on since it might match what you’re looking for.

As a new project we are still figuring out some of major topics you described.

In short, we built a data science pipeline tool that should fit well with existing workflows in machine learning and data science. We chose to embrace and integrate open source projects to create a simple and seamless experience with best in breed solutions for various tasks.

We are particularly happy with our deep integration of JupyterLab building on the excellent Jupyter Enterpise Gateway project from IBM (Codait) for connecting kernels directly to your pipelines. For scheduling we build on top of Celery combined with containerization primitives. For stable and well defined dependency management we built a small environment abstraction on top of Docker. It works really well in our experience!

Feel free to check out the project on https://github.com/orchest/orchest

Self hosting should be as easy as running about two lines of code.


Hi! Metaflow ships with a CloudFormation template for AWS that automates the set-up of a blob store (S3), compute environment (Batch), metadata tracking service (RDS), orchestrator (Step-Functions) notebooks (Sagemaker) and all the necessary IAM permissions to ensure data integrity. Using Metaflow, you can then write your workflows in Python/R and Metaflow will take care of managing your ML dev/prod lifecycle.

https://github.com/Netflix/metaflow-tools/tree/master/aws/cl...


There are lots of great platforms and tools in this space that are trying to solve these problems. They all have their tradeoffs and of course the list of needs/goals above is pretty diverse.

You are very likely going to be using a handful of tools that cover to full gamut of needs. This is heavily discussed in the blog post about an MLOps Canonical Stack and many of the tools being suggested below are included.

https://towardsdatascience.com/rise-of-the-canonical-stack-i...


Just wanted to let you know I favorited this comment for the responses it garnered. Great question, great efforts at providing answers.

This could be its own “Ask HN” thread.


we at Valohai (https://valohai.com/) do check all the mentioned feature boxes and do serve a lot of startups

* data stores: automatic download/upload to/from AWS S3, Azure Blob, Google Cloud Storage, OpenStack Swift or stores that implements S3-like interface

* interactive environments: we do have notebook hosting with automatic orchestration

* training: history, comparisons, parameters, hyperparameter tuning with Optuna, Hyperopt or custom optimizer (https://github.com/valohai/optimo); additionally visualizations about training progress and hardware resource monitoring

* serving for production: our deployments allow you to build, push, manage and monitor HTTP/S based services on Kubernetes clusters (https://docs.valohai.com/core-concepts/deployments/) but you can just as easily download your model and deploy it yourself as your use-case requires

* a permission system: we have organization management with teams and such, but your mileage may vary depending how fine grained control you need

* software heritage: all runs are containerized and how they was ran is recorded so everything is reproducible if the base image and data exist at the original source, we also keep track of data heritage (what files X were used to produce these files Y https://valohai.com/patch-notes/2019-09-03/)

* labeling system/UI: full web UI, command line client and a REST API (https://docs.valohai.com/valohai-api/) but no labeling tools though

Essentially your whole machine learning pipeline under one roof; from data preprocessing and training to deployment and monitoring. Also, we are technically agnostic, you can just as easily run Python/Julia/C++ or Unity engine to generate synthetic datasets (https://www.youtube.com/watch?v=QxMuWuk_W10)

not self-service or free though; our technical support team handles all the setup and maintenance

let me know if you have questions about Valohai or MLOps (https://valohai.com/mlops/) in general, I've seen quite a lot of projects and pipelines as I work at Valohai as an ML engineer helping our customers to setup end-to-end ML pipelines


If you are feeling overwhelmed with yet another machine learning pipeline automation framework, you should check out Kedro (https://github.com/quantumblacklabs/kedro).

Kedro has the simplest, leanest, functional-programming inspired pipeline definition and also spits out AirFlow and other formats readily + comes in with an integrated visualisation framework which is stunning & effective.


I really like the combination of these two tools.

I've played with cortex before, and it is easy to use, but I am still questionable if automating kubernetes deployments through an easy code interface, without much kubernetes know-how, is safe.

In my experience, even when you have a tool automating a lot of kubernetes for you, you will still run into trouble that will be best handled if you are familiar with kubernetes. I'm not sure what debugging utilities cortex has, but I think the ultimate solution to this problem will be a tool that truly allows users to not think about the fact their deployments are running on kubernetes at all.

I'm also interested in the similarities of Cortex and Seldon-core. Of course, seldon-core does not automate infra provisioning, but based on my previous point, I think many teams are better off being more hands on with this infra.

Lastly, there is a third tool missing from the mix - monitoring. I think cortex offers some tools in this area, but I wish they would make a part two showing how the monitoring functionality they offer can integrate into a retraining pipeline within metaflow. This post shows you how to get started, but it doesn't show you how to maintain applications long term.


How does this compare against TensorFlow Extended (TFX)? https://www.tensorflow.org/tfx


Metaflow was built to assist in both developing ML models and deploying/managing them in production. AFAIK, TFX is focused on the deployment story of ML pipelines.

https://docs.metaflow.org/introduction/what-is-metaflow#shou...


It's focused on building ML pipelines (similar to what Cortex aims to be). In addition, it also conveniently supports integration with Orchestrators like Airflow, Kubeflow, Beam etc. This book "Building Machine Learning Pipelines: Automating Model Life Cycles with TensorFlow" (https://www.amazon.com/dp/1492053198/ref=cm_sw_em_r_mt_dp_Ig...) goes into great details.

I was curious to see what advantage Metaflow offered over TFX.


has anyone done a comparison of ML pipelines from a devops centric perspective ?

For example, Metaflow doesnt support kubernetes today - https://github.com/Netflix/metaflow/issues/16

so ultimately the scale up story in most of these management tools is iffy.

I previously asked about kubeflow here - https://news.ycombinator.com/item?id=24808090 . Seems people think its pretty "horrendous". It seems most of these tools assume a very specialised devops team who will work around the ml tool...rather than the ml tool making this easy.


This would be super useful.

Based on this thread, the comparison should include

* metaflow (model training on AWS Batch) * polyaxon (model training on kubernetes) * pachyderm (experimentation) * hopsworks (model training/serving/ and more, mostly on kubernetes) * cortex (model serving on kubernetes) * seldon-core (model serving and monitoring on kubernetes)

and likely more that I missed.

I can see why it would be so hard to put together this comparison.

Even with all these tools, there is still a lot of manual work for data scientists or DevOps engineers the data scientists pass their models off to.

It also seems there is yet to be a fully open source DevOps stack. Most companies still build custom software to glue together manual processes (like integrations between different tools for training, deploying, monitoring, etc). This could be one factor why more comparisons of these tools and stack discussions have not been more popular - they can't share them yet.


As of today, Metaflow supports AWS Batch for scaling out and up. While supporting K8S is convenient in the operations point of view (assuming you have a K8S cluster already), it doesn't make the scalability story any better.


sure - im not talking about scalability from a science perspective, but really from practical applicability perspective.

Most people already are running k8s of some kind. I beginning to see k8s increasingly as an invariant. Plus not all of us are on AWS.

Secondly, AWS Batch is only applicable for metaflow for ml training.


I am trying to figure out Kubflow. Surprisingly I found this one easier to write. Haven't run it or used it yet.


Take a look at metaflow.org/sandbox if you want to test drive Metaflow.


It takes me into a verification and waiting flow. Useless.


Give it a few minutes :)


This looks similar to Google's Mediapipe [0]

[0] https://github.com/google/mediapipe


Is there a comparison with TFX, KubeFlow pipelines and AirFlow?


What are the main advantages compared to Airflow. We use Airflow to orchestrate ML jobs/tasks, and I found it to be more flexible comparing other tools we tested.


Metaflow is largely complementary to a job scheduler like Airflow. Technically you could export Metaflow workflows to Airflow, although the specific integration doesn't exist yet. For more details, see this blog article https://netflixtechblog.com/unbundling-data-science-workflow...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: