It seems that this might mean that ImageNet is becoming less useful a benchmark dataset. Some of the images have labels which I would have never guessed myself, and not because I don't know the difference between two types of stingray, but because I would have never said that the topic of the image was a seatbelt. It would be interesting to know how many errors are actually due to outputting a label which does exist in the image, but isn't in the labeled truth data and how many are plainly not in the image.
The accuracy reported is top-5 accuracy, meaning that a model is considered correct on a test image if it includes the expected label in its top 5 predicted labels. This does mitigate the multi-object issue quite a bit.
Most of the examples cited in the paper that their algorithm got wrong, I would also get 'wrong.' I don't think I would guess restaurant for any of the images with that label, I might have gotten the middle spotlight and I might have gotten the first letter opener right, but I'm not sure.
How do you explain that their performance is better than human? Is it in the obscure examples?
On the other hand, some of their correct images have questionable captions as well. For example, they labeled the geyser picture correctly, but their top labels also included "sandbar", "breakwater" and "leatherback turtle". A better scoring function, perhaps including hierarchies to account for the very vague "restaurant" photos and very specific dog breed photos, might be helpful. Otherwise, it seems like we might be overfitting to the peculiarities of this dataset.
Interesting. I typically am annoyed when people move the goalposts like this on AI but in my research it seems like the ImageNet library is annotating recursively through the various automation sets and not actually founded on human classification.
That is a good question. I went and read the about page at imagenet [1], but it wasn't completely obvious what the source is. One thing to note though is that the context as described there is "1000 images to illustrate each word." This doesn't exactly match with: "words which a person would most likely label an image." Check out the page on team sport for example [2]. Would you ever label one of those team sport? Or what about this one labeled 'fresh bean' [3]. It is not hard to find hard examples here [4].
I asked the question because it seems like their library has too many images for manual input and I never found a good answer on what the seed set was on the site.
I don't think the question should be "would you label these..." but moreso "is description 'X' an accurate representation of the image." In the latter case I say yes to most of the examples I find. The former question severely limits the set of possibilities to whomever is the one classifying it.
I know there has been a ton of research done on this stuff but to me the context threshold for image classification seems difficult to bound because humans might be looking for something very specific in a photo, to the point of impossibly difficult to annotate.
I can think of a million annotations for it:
- 9 men
- Island
- Trees near water
- Rocks in trees
- Backwards hat
- Blue water
- Rowing
- Oar splashing
- Nice day
- Warm weather
- White boat
- Saddleback mountain
etc...
So how would you rank those in terms of what the picture is really showing? The only thing I can think is of popularity, which makes the most sense to me in terms of how we natively classify - but it doesn't cover boundary cases or super narrow specificity.
Yeah that's a good point that "is description X and accurate representation" is a better question, though harder to ground truth on since the truth is a large set of possible 'accurate representations.'
From the PReLU paper, I found this:
> Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well trained on the validation images to be better aware of the existence of relevant classes. When annotating the test images, the human annotator is given a special interface, where each class title is accompanied by a row of 13 example training images. The reported human performance is estimated on a random subset of 1500 test images.