Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Building a natural description of images (googleresearch.blogspot.com)
146 points by bpierre on Nov 18, 2014 | hide | past | favorite | 15 comments


Note that many of the errors are much more understandable if one considers that the convolutional net pooling destroys much of the spatial relations in the pictures.

I imagine I might make similar errors if I only got little jumbled fragments to work from. Given those conditions, the cat "laying on a couch" or the dog "jumping to catch a frisbee" hardly even seem like errors to me.

This is going to get radically better when someone works out an efficient way to keep the spatial relations.


Geoff Hinton gave a talk last week at Berkeley on exactly this problem - in pixel space, object identities are all tangled up with location/pose information in a very nonlinear way; it would be nice to find a representation that actually preserves both components while disentangling them ("equivariance") instead of just throwing away all of the spatial information ("invariance", what convnets do). He's done some work on this, a lot of which is apparently unpublished, but gave a reference to one older paper covering some of the ideas: https://www.cs.toronto.edu/~hinton/absps/transauto6.pdf


Three components:

1. Object recognition (there's dog and frisbee in the photo) 2. Object localization (Dog and frisbee's ROI in the photo) 3. Relation estimation (Based on X factors, the dog might be chasing the frisbee).

Not sure what you meant by spatial relations (localization?) but recognizing (what) and localizing (where) would be key to drawing relationships between objects.

Really impressive work but definitely not a leap.


Relevant, very similar paper input/output wise, from our resident karpathy with a detailed discussion in comments: https://news.ycombinator.com/item?id=8621658


I am very surprised it got the color of the motorcycle wrong, it seems like the easiest thing to detect...


Much more likely than Xophmeister's explanation, I think, is that the brightness of a color in an image is relative to the lighting condition. (Remember that pink is just white mixed with red.) See image B:

http://www.huevaluechroma.com/pics/3-4.jpg

This is also true to some extent with the actual hue (i.e. red versus green, rather than brightness; see image C) but less so.


It may not have a sophisticated enough vocabulary to distinguish 'pink' when 'red' was close enough. This effect is manifest in human languages which classify colours differently: say, for example, a language may have no word for 'blue', so the sky is 'green' to its speakers; it's still perceptually different to them, of course, but the lack of fidelity means it can't be communicated better than "sky green" or "grass green".


Above the image of the pink (but labled red) motorbike there is an image of a child wearig a wooly hat. The hat is red with white fluff. That hat is labled pink.

So, it knows "pink". And the motorbike isn't borderline pink / red -- it's not like hunting pinks -- it is definitely pink.

Having said all that I'm amazed at the results. It feels like I'm living in the future.


I understand there are a lot of subtleties, but it's still surprising for me that it can recognize a parked motorcycle from an awkward angle but can't distinguish normal english pink from red.

Just to be clear, I don't mean it as a criticism, it just seems to be the easier part.


There was no pink motorcycle in the training dataset most probably, and neural net may have failed to generalize colors beetween the objects.

Similar error - yellow passanger car is described as yellow school bus - school bus is more common in yellow color.


This is interesting. I think that natural language generation is a largely overlooked task outside of machine translation -- perhaps because most tasks that might require it can get away with the much stupider, much easier job of filling in templates like a form letter. It's cool to see Google attempting the real thing, on top of the image recognition.

That said, I don't expect particularly high accuracy from the composition of an image recognition system and natural language generation. The first actual demo of this is going to be a source of utter hilarity. I hope they're okay with that.



The other way around. The link you gave is a duplicate of this post actually.


Here is the HN post from yesterday that also points to the NYTimes article: https://news.ycombinator.com/item?id=8621658


OK – this post has a higher ID than the other one; I assumed they were assigned sequentially.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: