Building a natural description of images

etiam · on Nov 18, 2014

Note that many of the errors are much more understandable if one considers that the convolutional net pooling destroys much of the spatial relations in the pictures.

I imagine I might make similar errors if I only got little jumbled fragments to work from. Given those conditions, the cat "laying on a couch" or the dog "jumping to catch a frisbee" hardly even seem like errors to me.

This is going to get radically better when someone works out an efficient way to keep the spatial relations.

davmre · on Nov 18, 2014

Geoff Hinton gave a talk last week at Berkeley on exactly this problem - in pixel space, object identities are all tangled up with location/pose information in a very nonlinear way; it would be nice to find a representation that actually preserves both components while disentangling them ("equivariance") instead of just throwing away all of the spatial information ("invariance", what convnets do). He's done some work on this, a lot of which is apparently unpublished, but gave a reference to one older paper covering some of the ideas: https://www.cs.toronto.edu/~hinton/absps/transauto6.pdf

iamsalman · on Nov 18, 2014

Three components:

1. Object recognition (there's dog and frisbee in the photo) 2. Object localization (Dog and frisbee's ROI in the photo) 3. Relation estimation (Based on X factors, the dog might be chasing the frisbee).

Not sure what you meant by spatial relations (localization?) but recognizing (what) and localizing (where) would be key to drawing relationships between objects.

Really impressive work but definitely not a leap.

SammoJ · on Nov 18, 2014

Relevant, very similar paper input/output wise, from our resident karpathy with a detailed discussion in comments: https://news.ycombinator.com/item?id=8621658

Trufa · on Nov 18, 2014

I am very surprised it got the color of the motorcycle wrong, it seems like the easiest thing to detect...

jessriedel · on Nov 18, 2014

Much more likely than Xophmeister's explanation, I think, is that the brightness of a color in an image is relative to the lighting condition. (Remember that pink is just white mixed with red.) See image B:

http://www.huevaluechroma.com/pics/3-4.jpg

This is also true to some extent with the actual hue (i.e. red versus green, rather than brightness; see image C) but less so.

Xophmeister · on Nov 18, 2014

It may not have a sophisticated enough vocabulary to distinguish 'pink' when 'red' was close enough. This effect is manifest in human languages which classify colours differently: say, for example, a language may have no word for 'blue', so the sky is 'green' to its speakers; it's still perceptually different to them, of course, but the lack of fidelity means it can't be communicated better than "sky green" or "grass green".

DanBC · on Nov 18, 2014

Above the image of the pink (but labled red) motorbike there is an image of a child wearig a wooly hat. The hat is red with white fluff. That hat is labled pink.

So, it knows "pink". And the motorbike isn't borderline pink / red -- it's not like hunting pinks -- it is definitely pink.

Having said all that I'm amazed at the results. It feels like I'm living in the future.

Trufa · on Nov 18, 2014

I understand there are a lot of subtleties, but it's still surprising for me that it can recognize a parked motorcycle from an awkward angle but can't distinguish normal english pink from red.

Just to be clear, I don't mean it as a criticism, it just seems to be the easier part.

ajuc · on Nov 18, 2014

There was no pink motorcycle in the training dataset most probably, and neural net may have failed to generalize colors beetween the objects.

Similar error - yellow passanger car is described as yellow school bus - school bus is more common in yellow color.

rspeer · on Nov 18, 2014

This is interesting. I think that natural language generation is a largely overlooked task outside of machine translation -- perhaps because most tasks that might require it can get away with the much stupider, much easier job of filling in templates like a form letter. It's cool to see Google attempting the real thing, on top of the image recognition.

That said, I don't expect particularly high accuracy from the composition of an image recognition system and natural language generation. The first actual demo of this is going to be a source of utter hilarity. I hope they're okay with that.

teddyh · on Nov 18, 2014

Duplicate of https://news.ycombinator.com/item?id=8623095

Bjoern · on Nov 18, 2014

The other way around. The link you gave is a duplicate of this post actually.

bennetthi · on Nov 18, 2014

Here is the HN post from yesterday that also points to the NYTimes article: https://news.ycombinator.com/item?id=8621658

teddyh · on Nov 19, 2014

OK – this post has a higher ID than the other one; I assumed they were assigned sequentially.