Launch HN: UpTrain (YC W23) – Open-source performance monitoring for ML models

KRAKRISMOTT · on March 8, 2023

Do you plan to add data management too? Because those are the biggest features offered by your competitors like weights and biased. Having a place to dump and load a few hundred gigabytes of data is very important because many on-demand cloud compute services don't offer persistence. Most ML training at scale aren't using Colab notebooks beyond initial prototyping because it's too expensive. Dealing with a cluster of servers and running Jupyter on them is already annoying enough, so having data management abstracted away makes life a lot easier.

https://wandb.ai/site/artifacts

Make sure to talk to your users while building this. Some platforms didn't, for example

https://docs.grid.ai/features/datastores

Grid/Lightning's data management is half baked. They only allow mounting one set of data per instance, which is close to useless for any training beyond the most simplistic of applications because most data aren't nicely cleaned. You often have to bring together disparate sets of data for multi-modal applications.

sourabh03agr · on March 8, 2023

Thanks for the question! Our initial focus is more on how to find the most relevant data-points from the hundred gigabytes of data to retrain the model on. Our current data management strategy is pretty primitive, either local files or we connect back to your data warehouse for persistence.

Soon, we plan to add data management features too but primarily on the production side so that data scientists can safely and securely version the data which their AI application came across in production as well as use it to refine their model (if allowed)

vvipgupta · on March 8, 2023

Thanks for the suggestion and links. Completely agree, ML production data management can be painful and to support model refinement for users that operate at scale, an abstraction at the data layer would be a useful feature.

thelastbender12 · on March 8, 2023

Congrats on the launch, looks like a cool product! Just scanned the docs, so not super sure if my specific use case is supported.

I previously worked on a content recommendation system for academic users. We often wanted to go back and look through specific user sessions to see if the recommendations made sense in the context of their activity. So, ground truth data was kind of available, but only at a later time.

Is this kind of post-hoc analysis in your product scope? Looking at the code examples, it seems like you have to provide ground-truth data at inference time?

sourabh03agr · on March 8, 2023

Thanks! So, providing ground-truth is optional (and can be added at a later time). For each logged input, the tool returns an identifier which you can use to attach the ground truth (or any other relevant information which your custom monitors need). Once provided, the tool runs the relevant checks on those cases and alerts if any issues are found.

ensemblehq · on March 8, 2023

Interesting thought. We ran into the same issue working on a recommendation system engine as well and previously tried to build a solution around it. Curious to know what’s driving your interest in post-analysis?

sourabh03agr · on March 8, 2023

Also, happy to share our learnings of working with a social media customer. For them, key motivation was to understand where the model is failing and hence, understand how to improve it. They started with offline experiments with focus on improving AUC but the graph saturates pretty quickly. While an incremental model improvement didn't lead to a sizeable change in offline metrics, it can impact retention of a certain user group and hence, overall revenue. They are using us to get insights on online experimentation. They have defined custom measures to monitor the distribution of model outputs and are able to determine the difference in model's performance much more effectively as compared to relying on changes in business metrics like retention, revenue, etc. This helps them to find out poor-performing cohorts and roll-out model improvements in weeks, not months. Would also love to hear what issues you ran into?

thelastbender12 · on March 8, 2023

Oh, to be specific, the platform was oriented towards test practice and our objective was to recommend questions in a sequence.

One good strategy that correlated with session length for us was asking questions that were neither too difficult or easy, based on what we knew of the user's level at that instant. The post hoc analysis was really meant to dig into multiple user sessions and see if the current method was working and evaluate counterfactual strategies.

I imagine the Uptrain product could help us segment user cohorts, find out which ones aren't performing super well etc. Would love to hear what you ended up building too?

shresh0202 · on March 9, 2023

I have seen a lot of observability solutions and they don't seem to work for Deep learning models. Can you explain why your approach will work, say from a language model perspective?

sourabh03agr · on March 9, 2023

Yes, general observability tools don't work well for ML applications as they lack support for ML specific use cases such as attaching Ground Truth label, data drift, model bias, etc.

Beyond that, requirements for Deep learning models is even more nuanced. Say, for language models, we provide two key features to effectively monitor them:

1. We represent the text by an embedding (e.g. BERT) which are much more informative from a statistical distribution perspective to find out edge cases, low density regions, etc. Further we use Earth Moving Distance to quantify data drift in the multi-dimensional space.

2. We allow user-defined smart signals to be written on top of your model inputs/outputs. You can classify a certain prediction to be wrong if it doesn't follow grammar rules or have occurrence of certain keywords or the prediction is followed by a certain user behaviour pattern (if an user is not satisfied with the response of ChatGPT, it is expected that they will ask the same question again and again in different ways). All these customisations serve as a good proxy to actually observe the model's performance and find avenues to improve them.

Areibman · on March 9, 2023

General question for the MLOps community: I usually see tools like this launch all the time. I'm eager to experiment, but I find my use cases never fit, and I end up just building a simple internal tool that does the job. Usually because the dataset or model is too unique.

Is it just me?

vvipgupta · on March 9, 2023

Generally, MLOps helps in reducing engineering headaches. During our user interviews and customer calls, we realized very early that customization is key for ML model monitoring since all models are different. Thus, we have built the framework to lessen the engineering headache while allowing customizability (think PyTorch). Would love to know your thoughts on this.

sourabh03agr · on March 9, 2023

Can you describe your use case?

We also faced the same problem with other tools and hence building UpTrain with customisation at the core of it. Would be interesting to see if your use case fits

elijahbenizzy · on March 8, 2023

How do you compare to a platform like Arize? https://arize.com/

sourabh03agr · on March 8, 2023

Being open-source is a key differentiator. With us being open-source, UpTrain can be easily customized for any specific use-case.

With UpTrain, one can define custom measures to monitor upon, add custom algorithms for model stability or drift detection as well as fill in any integration gaps in terms of using us in production

vvipgupta · on March 8, 2023

Additionally, refinement is a key focus of ours. Figuring out the best data points to retrain the model upon has twin benefits:

1) It provides automated issue resolution and saves data scientists' effort to debug and fix their models. 2) It allows us to reduce false positives in alerting: we send alerts only when we see a dip in model performance, or retraining can lead to improved model accuracy.

elijahbenizzy · on March 8, 2023

Awesome! Big fan of OS -- arize is powerful yet expensive, so I think there's a big market there. Alerting is super tough to get right, and false positives are often worse than no alerting at all. In ML its even harder cause "data looks weird" is like 90% of the bugs.

Anyway, congrats! Excited to see where you go with this.

vvipgupta · on March 9, 2023

Thanks! Also, wondering how did you hear about Arize? Have you dealt with the pain of ML model monitoring in the past?

elijahbenizzy · on March 9, 2023

Yeah, so we used it and built some custom solutions at Stitch Fix. Reach out to my co-founder Stefan (also in YC '23) -- he'll have some insight for you.

sourabh03agr · on March 11, 2023

Thanks! Reaching out to Stefan

kk58 · on March 8, 2023

How does your model illustrate specific feature combination for deep learning models - let's say instance segmentation

sourabh03agr · on March 8, 2023

There are two ways:

1. We use model-inferred embeddings. Say, for the instance segmentation task, we use deep learning networks to transform the input image into a dense embeddings representation, on top of which we run clustering and density estimation to find if the given embedding/image/feature combination is an outlier (or belongs to low-density region)

2. We allow users to define custom signals to identify edge-cases, specific to their use-case. A very simple example could be calculate brightness or Hue properties on the input image and see if that is an outlier compared to the training distribution.

vvipgupta · on March 9, 2023

Just curious, what kind of use cases do you have in mind?

nish93 · on March 16, 2023

Congrats on the launch, this looks great, forwarding this to my ML engineer friends!

jackthetab · on March 9, 2023

FYI https://docs.uptrain.ai/docs/get-started returns a 404.

vvipgupta · on March 9, 2023

Thanks for letting us know. We did some docs restructuring before the launch, and missed fixing this link. It is now available here: https://docs.uptrain.ai/docs/uptrain-examples/quickstart-tut...

la64710 · on March 8, 2023

Very cool product and best of all you made it open source to remove the barrier of installing it in someone’s own environment.

sourabh03agr · on March 8, 2023

Thanks a lot! Yes, we are big believers of open-source :)

gidim · on March 8, 2023

Excited to see more people building in this space. From what we've seen with customers it's critical to be able to compare what you're seeing in production to what you trained on (rather than historical period). That's almost the textbook definition of drift. Do you have a sense on how to approach that?

At Comet.com (disclaimer: i'm the CEO/Co-founder) we provide experiment tracking and artifacts management so we have the training distributions for comparison. I'm always curious how it looks like for a monitoring only solution

sourabh03agr · on March 8, 2023

Completely agree! We have also seen our users more concerned about comparing the real-world distribution against the training data as compared to previous month's data (we found latter is more useful for PMs and setting alerts).

We currently allow users to specify their training data in the config which is used to initialise the UpTrain framework (in form of json file but are planning to support pytorch/tf data-loaders). In the background, the tool does all the binning and clustering to convert these continuous variables into discrete buckets to later calculate divergence, which is then used to quantify drift.

vvipgupta · on March 8, 2023

Thanks for the very relevant comment :) We provide users the option to attach their training data from csv/json (working to support loading from cloud storage provider or data lakes). We have illustrated this in some of our examples, such as the human orientation classification: https://github.com/uptrain-ai/uptrain/blob/main/examples/hum...