The comparisons are pretty tricky to do right, especially with systems that have...

The comparisons are pretty tricky to do right, especially with systems that have been trained with the assumption that they are operating as "a second check". For what it's worth, that language was popularized by the first such system approved to market by FDA, in the mid-late 90s. It had, amongst other things, a NN stage.

Even at that time, such systems were better than some radiologists at most tasks, and most radiologists at some tasks - but breadth was the problem, as was generalization over a lot of different set ups.

I think this is more a data problem than an algorithmic one. With something as narrow as screening mammo CAD (very different than diagnostic), it's quite plausible that it could become a more effective "1st pass" tool than humans on average, but to get there would involve unprecedented data sharing and access (that 1st system was trained on a few thousand 2d sets, nothing like enough to capture sample variability)