Interesting. I typically am annoyed when people move the goalposts like this on AI but in my research it seems like the ImageNet library is annotating recursively through the various automation sets and not actually founded on human classification.
That is a good question. I went and read the about page at imagenet [1], but it wasn't completely obvious what the source is. One thing to note though is that the context as described there is "1000 images to illustrate each word." This doesn't exactly match with: "words which a person would most likely label an image." Check out the page on team sport for example [2]. Would you ever label one of those team sport? Or what about this one labeled 'fresh bean' [3]. It is not hard to find hard examples here [4].
I asked the question because it seems like their library has too many images for manual input and I never found a good answer on what the seed set was on the site.
I don't think the question should be "would you label these..." but moreso "is description 'X' an accurate representation of the image." In the latter case I say yes to most of the examples I find. The former question severely limits the set of possibilities to whomever is the one classifying it.
I know there has been a ton of research done on this stuff but to me the context threshold for image classification seems difficult to bound because humans might be looking for something very specific in a photo, to the point of impossibly difficult to annotate.
I can think of a million annotations for it:
- 9 men
- Island
- Trees near water
- Rocks in trees
- Backwards hat
- Blue water
- Rowing
- Oar splashing
- Nice day
- Warm weather
- White boat
- Saddleback mountain
etc...
So how would you rank those in terms of what the picture is really showing? The only thing I can think is of popularity, which makes the most sense to me in terms of how we natively classify - but it doesn't cover boundary cases or super narrow specificity.
Yeah that's a good point that "is description X and accurate representation" is a better question, though harder to ground truth on since the truth is a large set of possible 'accurate representations.'
From the PReLU paper, I found this:
> Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well trained on the validation images to be better aware of the existence of relevant classes. When annotating the test images, the human annotator is given a special interface, where each class title is accompanied by a row of 13 example training images. The reported human performance is estimated on a random subset of 1500 test images.
Is that accurate?