But generally, hilarious labeling errors are widespread already in benchmarks:
https://news.ycombinator.com/item?id=26628778
Companies also seem rather carefree with their labeling, including even in contexts in which accuracy is paramount:
https://news.ycombinator.com/item?id=38455338
And things get interesting when you start labeling alleged political bias, for instance:
https://news.ycombinator.com/item?id=35982799
Thus, in general, I'd appreciate research on the "accuracy" (i.e., labelers' inter-group agreement, etc.) of labels used in the wild.