It's not obvious to me what it means to have "better than human-level performance" since most of the time the ground-truth itself is defined by humans :)
One example I can think of is a computer could read a house number or a street sign from 100 feet away, where a human with good vision might be able to make out the same text at 20 feet.
You could also ask many humans and average their response to obtain a golden label, and see how well any particular human agrees with the average. If there is a lot of variance in the human answers, then it's possible for a machine to have better than (individual) human performance, even on a human labelled data set.
In production you may use a process that's more economized than what you'd use to establish a ground truth in research. So the human team in production will have some error rate that is tolerated or won't include as much redundancy. In this sense it's easy to imagine better than human performance from the software, in the sense that the human performance you get isn't a single max value, but rather a function of budget.