Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Launch HN: UpTrain (YC W23) – Open-source performance monitoring for ML models
138 points by vvipgupta on March 8, 2023 | hide | past | favorite | 31 comments
Hello, we are Shikha, Sourabh, and Vipul - co-founders at UpTrain, an open-source ML observability toolkit. UpTrain helps you monitor the performance of your machine learning applications, alerts you when they go wrong, and helps you improve them by narrowing down on data points to retrain on, all in the same loop.

Our website is at: https://uptrain.ai/ and our Github is here: https://github.com/uptrain-ai/uptrain

ML models tend to perform poorly when presented with new and previously unseen cases as well as their performance deteriorates over time due to evolving real-world environments, which can lead to the degradation of business metrics. In fact, one of our customers (a social media platform with 150 million MAU) was tired of discovering model issues via customer complaints (and increased churn) and wanted an observability solution to identify them proactively.

UpTrain monitors the difference between the dataset the model was trained on and the real-world data it encounters during production (the wild!). This "difference" can be custom statistical measures designed by ML practitioners based on their use case. That last point regarding customization is important because, in most cases, there’s no “ground truth” to check if a model’s output is correct or not. Instead, you need to use statistical measures to figure out drift or performance degradation issues, and those require domain expertise and differ from case to case. For example, in a text summarization model, you want to monitor drift in the input text sentiment, but for a human pose estimation model, you want to add integrity checks on the predicted body length.

Additionally, we monitor for edge cases defined as rule-based smart signals on the model input. Whenever UpTrain sees a distribution shift or an increased frequency of edge cases, it raises an alert while identifying the subset of data that experienced these issues. Finally, it retrains the model on that data, improving its performance in the wild.

Before UpTrain, we explored many observability tools at previous companies (Bytedance, Meta, and Bosch), but always got stuck figuring out what issues our models were facing in production. We used to go through user reviews, find patterns around model failures and manually retrain our models. This was time-consuming and opaque. Customizing our monitoring metrics and having a solution built specifically for ML models was a big need that wasn’t fulfilled.

Additionally, many ML models operate on user-sensitive data, and we didn’t want to send users’ private data to third parties. From a privacy perspective, relying on third-party hosted solutions just felt wrong, and motivated us to create an open-source self-hosted alternative for the same.

We are building UpTrain to make model monitoring effortless. With a single-line integration, our toolkit allows you to detect dips in model performance using real-time dashboards, sends you Slack alerts, helps to pinpoint poor-performing cohorts, and many more. UpTrain is built specifically for ML use cases, providing tools to monitor data distribution shifts, identify production data points with low representation in training data, and visualization/drift detection for embeddings. For more about our key features, see https://docs.uptrain.ai/docs/key-features

Our tool is available as a Python package that can be installed on top of your deployment infrastructure (AWS, GCP, Azure). Since ML models operate on user-sensitive data, and sharing it with external servers is often a barrier to using third-party tools, we focus on deploying to your own cloud.

We’ve launched this repo under an Apache 2.0 license to make it easy for individual developers to integrate it into their production app. For monetization, we plan to build enterprise-level integrations that will include managed service and support. In the next few months, we plan to add more advanced observability measures for large language models and generative AI, as well as make UpTrain easier to integrate with other tools like Weights and Biases, Databricks, Kubernetes, and Airflow.

We would love for you to try out our GitHub repo and give your feedback, and we look forward to all of your comments!



Do you plan to add data management too? Because those are the biggest features offered by your competitors like weights and biased. Having a place to dump and load a few hundred gigabytes of data is very important because many on-demand cloud compute services don't offer persistence. Most ML training at scale aren't using Colab notebooks beyond initial prototyping because it's too expensive. Dealing with a cluster of servers and running Jupyter on them is already annoying enough, so having data management abstracted away makes life a lot easier.

https://wandb.ai/site/artifacts

Make sure to talk to your users while building this. Some platforms didn't, for example

https://docs.grid.ai/features/datastores

Grid/Lightning's data management is half baked. They only allow mounting one set of data per instance, which is close to useless for any training beyond the most simplistic of applications because most data aren't nicely cleaned. You often have to bring together disparate sets of data for multi-modal applications.


Thanks for the question! Our initial focus is more on how to find the most relevant data-points from the hundred gigabytes of data to retrain the model on. Our current data management strategy is pretty primitive, either local files or we connect back to your data warehouse for persistence.

Soon, we plan to add data management features too but primarily on the production side so that data scientists can safely and securely version the data which their AI application came across in production as well as use it to refine their model (if allowed)


Thanks for the suggestion and links. Completely agree, ML production data management can be painful and to support model refinement for users that operate at scale, an abstraction at the data layer would be a useful feature.


Congrats on the launch, looks like a cool product! Just scanned the docs, so not super sure if my specific use case is supported.

I previously worked on a content recommendation system for academic users. We often wanted to go back and look through specific user sessions to see if the recommendations made sense in the context of their activity. So, ground truth data was kind of available, but only at a later time.

Is this kind of post-hoc analysis in your product scope? Looking at the code examples, it seems like you have to provide ground-truth data at inference time?


Thanks! So, providing ground-truth is optional (and can be added at a later time). For each logged input, the tool returns an identifier which you can use to attach the ground truth (or any other relevant information which your custom monitors need). Once provided, the tool runs the relevant checks on those cases and alerts if any issues are found.


Interesting thought. We ran into the same issue working on a recommendation system engine as well and previously tried to build a solution around it. Curious to know what’s driving your interest in post-analysis?


Also, happy to share our learnings of working with a social media customer. For them, key motivation was to understand where the model is failing and hence, understand how to improve it. They started with offline experiments with focus on improving AUC but the graph saturates pretty quickly. While an incremental model improvement didn't lead to a sizeable change in offline metrics, it can impact retention of a certain user group and hence, overall revenue. They are using us to get insights on online experimentation. They have defined custom measures to monitor the distribution of model outputs and are able to determine the difference in model's performance much more effectively as compared to relying on changes in business metrics like retention, revenue, etc. This helps them to find out poor-performing cohorts and roll-out model improvements in weeks, not months. Would also love to hear what issues you ran into?


Oh, to be specific, the platform was oriented towards test practice and our objective was to recommend questions in a sequence.

One good strategy that correlated with session length for us was asking questions that were neither too difficult or easy, based on what we knew of the user's level at that instant. The post hoc analysis was really meant to dig into multiple user sessions and see if the current method was working and evaluate counterfactual strategies.

I imagine the Uptrain product could help us segment user cohorts, find out which ones aren't performing super well etc. Would love to hear what you ended up building too?


I have seen a lot of observability solutions and they don't seem to work for Deep learning models. Can you explain why your approach will work, say from a language model perspective?


Yes, general observability tools don't work well for ML applications as they lack support for ML specific use cases such as attaching Ground Truth label, data drift, model bias, etc.

Beyond that, requirements for Deep learning models is even more nuanced. Say, for language models, we provide two key features to effectively monitor them:

1. We represent the text by an embedding (e.g. BERT) which are much more informative from a statistical distribution perspective to find out edge cases, low density regions, etc. Further we use Earth Moving Distance to quantify data drift in the multi-dimensional space.

2. We allow user-defined smart signals to be written on top of your model inputs/outputs. You can classify a certain prediction to be wrong if it doesn't follow grammar rules or have occurrence of certain keywords or the prediction is followed by a certain user behaviour pattern (if an user is not satisfied with the response of ChatGPT, it is expected that they will ask the same question again and again in different ways). All these customisations serve as a good proxy to actually observe the model's performance and find avenues to improve them.


General question for the MLOps community: I usually see tools like this launch all the time. I'm eager to experiment, but I find my use cases never fit, and I end up just building a simple internal tool that does the job. Usually because the dataset or model is too unique.

Is it just me?


Generally, MLOps helps in reducing engineering headaches. During our user interviews and customer calls, we realized very early that customization is key for ML model monitoring since all models are different. Thus, we have built the framework to lessen the engineering headache while allowing customizability (think PyTorch). Would love to know your thoughts on this.


Can you describe your use case?

We also faced the same problem with other tools and hence building UpTrain with customisation at the core of it. Would be interesting to see if your use case fits


How do you compare to a platform like Arize? https://arize.com/


Being open-source is a key differentiator. With us being open-source, UpTrain can be easily customized for any specific use-case.

With UpTrain, one can define custom measures to monitor upon, add custom algorithms for model stability or drift detection as well as fill in any integration gaps in terms of using us in production


Additionally, refinement is a key focus of ours. Figuring out the best data points to retrain the model upon has twin benefits:

1) It provides automated issue resolution and saves data scientists' effort to debug and fix their models. 2) It allows us to reduce false positives in alerting: we send alerts only when we see a dip in model performance, or retraining can lead to improved model accuracy.


Awesome! Big fan of OS -- arize is powerful yet expensive, so I think there's a big market there. Alerting is super tough to get right, and false positives are often worse than no alerting at all. In ML its even harder cause "data looks weird" is like 90% of the bugs.

Anyway, congrats! Excited to see where you go with this.


Thanks! Also, wondering how did you hear about Arize? Have you dealt with the pain of ML model monitoring in the past?


Yeah, so we used it and built some custom solutions at Stitch Fix. Reach out to my co-founder Stefan (also in YC '23) -- he'll have some insight for you.


Thanks! Reaching out to Stefan


How does your model illustrate specific feature combination for deep learning models - let's say instance segmentation


There are two ways:

1. We use model-inferred embeddings. Say, for the instance segmentation task, we use deep learning networks to transform the input image into a dense embeddings representation, on top of which we run clustering and density estimation to find if the given embedding/image/feature combination is an outlier (or belongs to low-density region)

2. We allow users to define custom signals to identify edge-cases, specific to their use-case. A very simple example could be calculate brightness or Hue properties on the input image and see if that is an outlier compared to the training distribution.


Just curious, what kind of use cases do you have in mind?


Congrats on the launch, this looks great, forwarding this to my ML engineer friends!



Thanks for letting us know. We did some docs restructuring before the launch, and missed fixing this link. It is now available here: https://docs.uptrain.ai/docs/uptrain-examples/quickstart-tut...


Very cool product and best of all you made it open source to remove the barrier of installing it in someone’s own environment.


Thanks a lot! Yes, we are big believers of open-source :)


Excited to see more people building in this space. From what we've seen with customers it's critical to be able to compare what you're seeing in production to what you trained on (rather than historical period). That's almost the textbook definition of drift. Do you have a sense on how to approach that?

At Comet.com (disclaimer: i'm the CEO/Co-founder) we provide experiment tracking and artifacts management so we have the training distributions for comparison. I'm always curious how it looks like for a monitoring only solution


Completely agree! We have also seen our users more concerned about comparing the real-world distribution against the training data as compared to previous month's data (we found latter is more useful for PMs and setting alerts).

We currently allow users to specify their training data in the config which is used to initialise the UpTrain framework (in form of json file but are planning to support pytorch/tf data-loaders). In the background, the tool does all the binning and clustering to convert these continuous variables into discrete buckets to later calculate divergence, which is then used to quantify drift.


Thanks for the very relevant comment :) We provide users the option to attach their training data from csv/json (working to support loading from cloud storage provider or data lakes). We have illustrated this in some of our examples, such as the human orientation classification: https://github.com/uptrain-ai/uptrain/blob/main/examples/hum...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: