Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How important is labeling for ML algorithms?
2 points by canaus on Jan 29, 2024 | hide | past | favorite | 5 comments
Trying to understand the depth and necessity of labeling for an algorithm, specifically having people or other algorithms label and check the labeling of data.

Do the data sets need to be labeled extensively every time an algorithm needs to be trained? How often does a new data set need to be labeled, is it only with novel data?



Absolutely important. No labels = No model.

You have to label data once for every problem you want to solve. You could apply any number of different classifiers to the same problem but use the same labels.

My RSS reader YOShInOn does content-based recommendations and needs about 500-1000 labels to make decent recommendations. With 8000 I get an area-under-curve of around 78%, TiKTok gets around 84% with vastly more investment as well as a lot more data (because it serves more users.) One reason I haven't moved to productize it is that I think most people would expect to get results with 5-10.

You can probably make about 2000 judgements a day (I've done bursts of a lot more than that on images and it makes my visual system malfunction though I might be especially vulnerable to that) and my impression is that for most text classification tasks you don't really get better performance when you go past 10,000 labels or so. Everybody thinks you have to raise $20M from VCs and then hire a bunch of people in a third world country to label things but the reality is that people spend more time talking about a product they want to make than it takes to make enough labels yourself to make an image.

But there is a garbage-in-garbage out problem. A recommender can't get 99% accuracy the way some things could because people's preferences are fickle, on the other hand a recommender with 78% AUC is pretty satisfying to use. If you need high accuracy you need to do some ontology work so you know your categories are very well defined. If people don't the labelling can't agree 99% of the time there is no way the model is going to get there.


> but the reality is that people spend more time talking about a product they want to make than it takes to make enough labels yourself

And what might the validity of such self-made labels be, if I may ask?


Depends on what you are labeling. How accurately could you look at pictures of cars and tell marked police cars from regular cars? When I do them in a hurry though I fat finger maybe 1-5% of them so it takes some cleanup.


Fair enough, and that's why I suppose for instance CATPCHAs and such make us do silly free labor.

But generally, hilarious labeling errors are widespread already in benchmarks:

https://news.ycombinator.com/item?id=26628778

Companies also seem rather carefree with their labeling, including even in contexts in which accuracy is paramount:

https://news.ycombinator.com/item?id=38455338

And things get interesting when you start labeling alleged political bias, for instance:

https://news.ycombinator.com/item?id=35982799

Thus, in general, I'd appreciate research on the "accuracy" (i.e., labelers' inter-group agreement, etc.) of labels used in the wild.


How much more useful would it be to get multiple people to label data?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: