Launch HN: Lightly (YC S21): Label only the data which improves your ML model

citilife · on Aug 9, 2021

Having built a model to identify sensitive data having a solid data labeling solution would be awesome. I can attest this is a real problem. Here's the library we built:

https://github.com/capitalone/DataProfiler

In this space, Prodigy really dominates:

https://prodi.gy/

We actually built our own internal system which integrates and can export the labels (does predictive labeling, etc). Of course, we only focused on text data at the moment.

All that being said, this is going to become a crowded space and highly competitive. Plus, once the data is labeled companies often drop their labelers. I would recommend ensuring some consistent use. Potentially, hosting their models off-prem or something to lock companies in.

isusmelj · on Aug 9, 2021

The library looks great!

Prodi.gy is great but focuses heavily on NLP and speeding up the labeling process. Our goal is really to help you reduce what you label before you use any labeling tool.

We are working with labeling tools as well as providers to streamline the workflow.

tracyhenry · on Aug 10, 2021

Doesn't Prodi.gy also claim to do active learning, which essentially reduce the instances to label too?

Haven't used Prodi.gy so don't know how its active learning algo works. May you share the difference?

isusmelj · on Aug 10, 2021

Most active learning frameworks just use model predictions to find the images where the model has the lowest confidence and ignore the image diversity aspect. E.g. the model struggles with bicycles at night. The problem with this approach is that you might end up adding many new images to your labeling pipeline that are very similar to each other.

However, with Lightly you can additionally make sure you only select images that are visually different from each other. And you always get visual feedback on the selected data in our web platform. The additional control and feedback mechanism allow you to have a more focused workflow.

dpaleka · on Aug 9, 2021

How do models trained with Lightly compare with other approaches wrt adversarial robustness?

Can using Lightly introduce additional bias in the model, since only a select few of inputs are being labeled? This may be a concern for publicity purposes.

By the way, I thought ETH spinoff requirements were incompatible with YC requirements - nice to see it can be made to work.

isusmelj · on Aug 9, 2021

Thanks for the interest and great questions. Responses are below:

>How do models trained with Lightly compare with other approaches wrt adversarial robustness?

We have no benchmark available. Both approaches can be combined. You can use Lightly to pick a diverse subset, label it and then during training/ evaluating the model check for adversarial robustness and re-iterate.

>Can using Lightly introduce additional bias in the model, since only a select few of inputs are being labeled? This may be a concern for publicity purposes.

If we remove bias we automatically introduce bias. BUT we want the introduced bias to be controlled and known.

Bias typically comes from the way we collect data. For example, more data is being collected during the day than during night for autonomous driving. We also have more data collected during sunny weather than rain or snow. We also have more data from cities like San Fransisco than cities like New Mexico. Most of our datasets are biased.

> By the way, I thought ETH spinoff requirements were incompatible with YC requirements - nice to see it can be made to work.

From what we know we are the first ETH spin-off who is part of the YC program. we hope they don't abandon us.

zby · on Aug 10, 2021

An obvious trick to speed up supervised learning is to label and import into the training set only the images for which the model makes wrong predictions. So for most of the images the human only needs to approve the automatic predictions - and from time to time he needs to label them.

Are there any libraries to facilitate a workflow like this?

philippmwirth · on Aug 10, 2021

We are currently working to support exactly this workflow with Lightly. The biggest challenge is to quickly and reliably find the images with wrong predictions. To tackle this, Lightly can leverage the strong representations from contrastive learning.

For example: A simple workflow for a classification task would be to train a self-supervised model on the whole dataset and find samples with a different annotation than their nearest neighbors. These can be identified quickly either in a colored scatter plot or by simply measuring disagreement.

zby · on Aug 11, 2021

Any plans for a Tensorflow support?

isusmelj · on Aug 11, 2021

All the active learning features and interaction with the platform already works with different frameworks such as Keras, Tensorflow or Jax. The part for training self-supervised learning is currently only available for PyTorch. We don't have focused yet on bringing it to Tensorflow. But it's definitely something we should look at!

RicoElectrico · on Aug 9, 2021

How does it differentiate from modAL? I think at a glance they try to achieve roughly the same thing: give human labelers only the datapoints most relevant to the problem at hand.

https://github.com/modAL-python/modAL

MalteEbner · on Aug 9, 2021

modAL indeed has a similar goal of choosing the best subset of data to be labeled. However it has some notable differences:

It is built on scikit-learn which is also evident from the suggested workflow. Lightly on the other hand was specifically built for deep learning applications supporting active learning for classification but also object detection and semantic segmentation.

modAL provides uncertainty-based active learning. However, it has been shown that uncertainty-based AL fails at batch-wise AL for vision datasets and CNNs, see https://arxiv.org/abs/1708.00489. Furthermore it only works with an initially trained model and thus labeled dataset. Lightly offers self-supervised learning to learn high dimensional embeddings through its open-source package https://github.com/lightly-ai/lightly. They can be used through our API to choose a diverse subset. Optionally, this sampling can be combined with uncertainty-based AL.

RicoElectrico · on Aug 9, 2021

Thanks for the reply. In my case I have (hobby) problems that fall well into scikit-learn capabilities.

henning · on Aug 9, 2021

Since I don't do machine learning I don't know whether your product is good or not, but I automatically appreciate you calling this "machine learning" and not "AI".

isusmelj · on Aug 9, 2021

Thanks, we also try to avoid using AI as much as possible :)

NumberCruncher · on Aug 10, 2021

In the video you posted you say "I don't want to work with blurry images". Is that not an image a human could work with (drive), maybe by reducing the driving speed to 50-33% and having more time to inspect his surroundings?

isusmelj · on Aug 10, 2021

Yes, depending on the kind of data you want to work with you might want to explicitly get the blurry ones. That depends on the task of the ML model you train and requires domain expertise.

dest · on Aug 10, 2021

If the dataset mostly contains edge cases, model performance on the dataset is going to be poor, but it's not an issue I think.

But how could the real-world accuracy be computed? Is a separate dataset needed for that purpose?

isusmelj · on Aug 10, 2021

When learning ML at university one assumes that the data you have well represents the environment. We do the famous train/validation/test split and train our model.

However, in practice we see that it is very hard to collect a good dataset. There is a great twitter thread from Abubakar(CEO Gradio) about this topic: https://twitter.com/abidlabs/status/1423067498862219267

dest · on Aug 10, 2021

Thanks for your answer. I'm seeing a tweet, but not a thread. Is it expected?

isusmelj · on Aug 10, 2021

Yes, sorry. I meant he started a good conversation with his tweet.

wizwit999 · on Aug 10, 2021

I'll admit I'm not an expert in this area, but can't this introduce a pernicious sort of bias on the downstream model, since the input data is being curated by your technology?

isusmelj · on Aug 10, 2021

Whether you introduce or reduce bias using Lightly depends on how you use the software.

What we ideally want, is that the data we collect for training, validating, and testing our model represents the environment the trained model will operate in. However, collecting this data can be super difficult. E.g. for autonomous driving systems that would mean collecting data from every corner of the world, at every time of the day, during all kinds of weather conditions, in all 4 seasons...

However, this is a really difficult task. You will very likely end up having more data collected from one city (e.g. because your fleet is much bigger in city A than city B)

rmtescapist · on Aug 10, 2021

How does this compare to existing tools like Scale AI?

isusmelj · on Aug 10, 2021

Scale is one of the biggest companies in the data labeling space. They recently introduced Scale Nucleus which goes in a similar direction as Lightly. However, whereas Nucleus works well with already labeled datasets we designed Lightly from the beginning to focus on unlabeled data. With the combination of self-supervised learning and embedding visualization of Lightly, you can easily work with datasets where you don’t have model predictions or labels at hand. Note, that Lightly is also available on-prem which is a must for some of our customers since usually there is 100x more unlabeled data than labeled data.