Trying to understand the depth and necessity of labeling for an algorithm, specifically having people or other algorithms label and check the labeling of data.
Do the data sets need to be labeled extensively every time an algorithm needs to be trained? How often does a new data set need to be labeled, is it only with novel data?
You have to label data once for every problem you want to solve. You could apply any number of different classifiers to the same problem but use the same labels.
My RSS reader YOShInOn does content-based recommendations and needs about 500-1000 labels to make decent recommendations. With 8000 I get an area-under-curve of around 78%, TiKTok gets around 84% with vastly more investment as well as a lot more data (because it serves more users.) One reason I haven't moved to productize it is that I think most people would expect to get results with 5-10.
You can probably make about 2000 judgements a day (I've done bursts of a lot more than that on images and it makes my visual system malfunction though I might be especially vulnerable to that) and my impression is that for most text classification tasks you don't really get better performance when you go past 10,000 labels or so. Everybody thinks you have to raise $20M from VCs and then hire a bunch of people in a third world country to label things but the reality is that people spend more time talking about a product they want to make than it takes to make enough labels yourself to make an image.
But there is a garbage-in-garbage out problem. A recommender can't get 99% accuracy the way some things could because people's preferences are fickle, on the other hand a recommender with 78% AUC is pretty satisfying to use. If you need high accuracy you need to do some ontology work so you know your categories are very well defined. If people don't the labelling can't agree 99% of the time there is no way the model is going to get there.