The accuracy reported is top-5 accuracy, meaning that a model is considered correct on a test image if it includes the expected label in its top 5 predicted labels. This does mitigate the multi-object issue quite a bit.
Most of the examples cited in the paper that their algorithm got wrong, I would also get 'wrong.' I don't think I would guess restaurant for any of the images with that label, I might have gotten the middle spotlight and I might have gotten the first letter opener right, but I'm not sure.
How do you explain that their performance is better than human? Is it in the obscure examples?
On the other hand, some of their correct images have questionable captions as well. For example, they labeled the geyser picture correctly, but their top labels also included "sandbar", "breakwater" and "leatherback turtle". A better scoring function, perhaps including hierarchies to account for the very vague "restaurant" photos and very specific dog breed photos, might be helpful. Otherwise, it seems like we might be overfitting to the peculiarities of this dataset.