That doesn't matter if they're doing paired comparisons, as they are. 2 comparisons of a few hundred images yields far more reliable results than a few hundred comparisons of 2 images, because there are measurements both within and between images.
An analogous situation: if someone runs a blood pressure clinical trial, whose results will you believe more - a trial which measures one person's blood pressure on and off a drug several hundred times over a year or two, or a trial which measures several hundred peoples' blood pressure at the beginning and end of the trial?
Obviously the latter, because we know that there are big differences between people which must be measured if we want to make reliable predictions about the effect of the drug in the rest of the population, while additional blood pressure measurements of a person only reduces variability a little bit and helps only a little (because most of the sampling error was removed by the first pair of measurements, and further measurements leave the bulk of variance unaffected).