Surpassing Human-Level Performance on ImageNet Classification [pdf]

svantana · on Feb 9, 2015

Interesting, I've long wondered why parametric nonlinearities are not used more often. It adds very little to the overall parameter count (the number of units is most often dwarfed by the number of connections), but it should increase expressibility a lot (e.g. adaptively using soft or hard nonlinearities). Taken to its extreme, I've been toying with the idea of combining DNNs and Genetic Programming - using a large number of arbitrary symbolic expressions with high connectivity and many layers.

_ntka · on Feb 9, 2015

Genetic programming is an inefficient search method, and will require many evaluations of the cost function to optimize anything. In the case of DCNNs, evaluating an architecture can keep a modern GPU busy for days, so genetic algorithms are pretty much out of the question.

I think an easy way to improve our models in the short term is to make more of the parameters we use be learnable: the parameters of the non-linearity are a good place to start with, and another would be the parameters of the data augmentation transformations.

One could consider that learned data augmentation schemes implement a form of guided visual attention.

Animats · on Feb 10, 2015

That's an impressive result, considering how simple the algorithm really is. (The learning algorithm isn't obvious, but it's not a lot of code.)

Can this algorithm be run in reverse, to generate an image from the network? That's been done with one of the other deep neural net classifiers. "School bus" came out as a yellow blob with horizontal black lines.

An interesting question is whether there's a bias in the data set because humans composed the pictures, and humans like to take pictures of certain things. (Cats are probably over-represented.) Images taken by humans tend to have a primary subject, and that subject is usually roughly centered in the images. It might be useful to test against a data set taken from Google StreetView images, which lack such a composition bias.

dEnigma · on Feb 10, 2015

Can this algorithm be run in reverse, to generate an image from the network? That's been done with one of the other deep neural net classifiers. "School bus" came out as a yellow blob with horizontal black lines.

Do you have a link for that, sounds interesting (nothing turned up in a quick google search)

edit: I found something similar to what you were talking about, is this[1] what you meant?

[1]http://www.evolvingai.org/fooling

Houshalter · on Feb 10, 2015

That's the paper he's referring to. They used another NN to generate images, and selected the ones that the first NN predicted to be school buses the most.

dwiel · on Feb 10, 2015

It seems that this might mean that ImageNet is becoming less useful a benchmark dataset. Some of the images have labels which I would have never guessed myself, and not because I don't know the difference between two types of stingray, but because I would have never said that the topic of the image was a seatbelt. It would be interesting to know how many errors are actually due to outputting a label which does exist in the image, but isn't in the labeled truth data and how many are plainly not in the image.

_ntka · on Feb 10, 2015

The accuracy reported is top-5 accuracy, meaning that a model is considered correct on a test image if it includes the expected label in its top 5 predicted labels. This does mitigate the multi-object issue quite a bit.

dwiel · on Feb 10, 2015

Most of the examples cited in the paper that their algorithm got wrong, I would also get 'wrong.' I don't think I would guess restaurant for any of the images with that label, I might have gotten the middle spotlight and I might have gotten the first letter opener right, but I'm not sure.

How do you explain that their performance is better than human? Is it in the obscure examples?

greeneggs · on Feb 10, 2015

On the other hand, some of their correct images have questionable captions as well. For example, they labeled the geyser picture correctly, but their top labels also included "sandbar", "breakwater" and "leatherback turtle". A better scoring function, perhaps including hierarchies to account for the very vague "restaurant" photos and very specific dog breed photos, might be helpful. Otherwise, it seems like we might be overfitting to the peculiarities of this dataset.

AndrewKemendo · on Feb 10, 2015

Interesting. I typically am annoyed when people move the goalposts like this on AI but in my research it seems like the ImageNet library is annotating recursively through the various automation sets and not actually founded on human classification.

Is that accurate?

dwiel · on Feb 10, 2015

That is a good question. I went and read the about page at imagenet [1], but it wasn't completely obvious what the source is. One thing to note though is that the context as described there is "1000 images to illustrate each word." This doesn't exactly match with: "words which a person would most likely label an image." Check out the page on team sport for example [2]. Would you ever label one of those team sport? Or what about this one labeled 'fresh bean' [3]. It is not hard to find hard examples here [4].

  [1]: http://www.image-net.org/about-overview
  [2]: http://www.image-net.org/synset?wnid=n00887544
  [3]: http://www.image-net.org/nodes/8/07727578/5b/5bcde1a26fd46c485ca7df589e57ba2ded9cb789.thumb
  [4]: http://www.image-net.org/synset

AndrewKemendo · on Feb 10, 2015

I asked the question because it seems like their library has too many images for manual input and I never found a good answer on what the seed set was on the site.

I don't think the question should be "would you label these..." but moreso "is description 'X' an accurate representation of the image." In the latter case I say yes to most of the examples I find. The former question severely limits the set of possibilities to whomever is the one classifying it.

I know there has been a ton of research done on this stuff but to me the context threshold for image classification seems difficult to bound because humans might be looking for something very specific in a photo, to the point of impossibly difficult to annotate.

For example take this photo: http://farm1.static.flickr.com/222/497812075_d412946eef.jpg

I can think of a million annotations for it: - 9 men - Island - Trees near water - Rocks in trees - Backwards hat - Blue water - Rowing - Oar splashing - Nice day - Warm weather - White boat - Saddleback mountain etc...

So how would you rank those in terms of what the picture is really showing? The only thing I can think is of popularity, which makes the most sense to me in terms of how we natively classify - but it doesn't cover boundary cases or super narrow specificity.

dwiel · on Feb 10, 2015

Looks like they use Amazon Turk: http://www.image-net.org/papers/imagenet_cvpr09.pdf

Yeah that's a good point that "is description X and accurate representation" is a better question, though harder to ground truth on since the truth is a large set of possible 'accurate representations.'

From the PReLU paper, I found this:

> Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well trained on the validation images to be better aware of the existence of relevant classes. When annotating the test images, the human annotator is given a special interface, where each class title is accompanied by a row of 13 example training images. The reported human performance is estimated on a random subset of 1500 test images.