Ask HN: How important is labeling for ML algorithms?

PaulHoule · on Jan 29, 2024

Absolutely important. No labels = No model.

You have to label data once for every problem you want to solve. You could apply any number of different classifiers to the same problem but use the same labels.

My RSS reader YOShInOn does content-based recommendations and needs about 500-1000 labels to make decent recommendations. With 8000 I get an area-under-curve of around 78%, TiKTok gets around 84% with vastly more investment as well as a lot more data (because it serves more users.) One reason I haven't moved to productize it is that I think most people would expect to get results with 5-10.

You can probably make about 2000 judgements a day (I've done bursts of a lot more than that on images and it makes my visual system malfunction though I might be especially vulnerable to that) and my impression is that for most text classification tasks you don't really get better performance when you go past 10,000 labels or so. Everybody thinks you have to raise $20M from VCs and then hire a bunch of people in a third world country to label things but the reality is that people spend more time talking about a product they want to make than it takes to make enough labels yourself to make an image.

But there is a garbage-in-garbage out problem. A recommender can't get 99% accuracy the way some things could because people's preferences are fickle, on the other hand a recommender with 78% AUC is pretty satisfying to use. If you need high accuracy you need to do some ontology work so you know your categories are very well defined. If people don't the labelling can't agree 99% of the time there is no way the model is going to get there.

jruohonen · on Jan 29, 2024

> but the reality is that people spend more time talking about a product they want to make than it takes to make enough labels yourself

And what might the validity of such self-made labels be, if I may ask?

PaulHoule · on Jan 29, 2024

Depends on what you are labeling. How accurately could you look at pictures of cars and tell marked police cars from regular cars? When I do them in a hurry though I fat finger maybe 1-5% of them so it takes some cleanup.

jruohonen · on Jan 29, 2024

Fair enough, and that's why I suppose for instance CATPCHAs and such make us do silly free labor.

But generally, hilarious labeling errors are widespread already in benchmarks:

https://news.ycombinator.com/item?id=26628778

Companies also seem rather carefree with their labeling, including even in contexts in which accuracy is paramount:

https://news.ycombinator.com/item?id=38455338

And things get interesting when you start labeling alleged political bias, for instance:

https://news.ycombinator.com/item?id=35982799

Thus, in general, I'd appreciate research on the "accuracy" (i.e., labelers' inter-group agreement, etc.) of labels used in the wild.

canaus · on Jan 29, 2024

How much more useful would it be to get multiple people to label data?