You could also ask many humans and average their response to obtain a golden label, and see how well any particular human agrees with the average. If there is a lot of variance in the human answers, then it's possible for a machine to have better than (individual) human performance, even on a human labelled data set.