I found it curious that this Bot is really bad at recognizing apes: chimpanzees and gorillas specifically.
I fed it a lot of the images from a Google image search for these animals and more often than not it either doesn't recognize anything or considers them bears.
I don't mean to offend, but I'm left wondering if the creators of image recognition services disincentivize their neural nets from recognizing something as an ape, gorilla or chimpanzee so as to avoid the same mistake Google made when it falsely recognized black people as gorillas [1].
Wow. 10 years ago it would have been seen as a comical blunder of a stupid AI. Something to be fixed for sure, but not a Serious Social Issue by any means. Nowadays, it apparently warrants several follow-up articles in the mainstream media, "social" commentary and 3,309 retweets.
From the blog post:
“The bias of the Internet reflects the bias of society,” she said.
In some cases - yes, but this one seems more like society deliberately projecting human motivations onto a primitive algorithm and jumping to conclusions about what its errors "really" mean.
-
For people who disagree, here is a scenario your might want to consider. Imagine that you built an image tagging service. Imagine that someone found a glitch in your service that they consider offensive. Imagine them tweeting about it (before or instead of contacting you directly) and getting a similar kind of reaction, complete with extensive social commentary and media coverage. Nice, big crowd of people using your company and your service as a convenient example of things-that-are-wrong-with-our-society. How would you feel in that case?
The issue here is that race relations in the U.S. are so convoluted and have so much history that it's literally impossible to tell what would be considered "racist" without a broad understanding of the culture and a comprehensive list of past racial slurs and grievances.
Witness the KFC ad (https://www.youtube.com/watch?v=ZaIhf41ctkM) which was broadly labeled as "racist", despite the fact that the 'black people love fried chicken' stereotype is (as far as I know) only a U.S. construction.
> The issue here is that race relations in the U.S. are so convoluted and have so much history that it's literally impossible to tell what would be considered "racist" without a broad understanding of the culture and a comprehensive list of past racial slurs and grievances.
I understand, but isn't the logical conclusion that one cannot make a piece of technology (like an image caption bot) that is unaware of, for instance, such complex racial relations? And if so, should technological progress really be hampered by people's sense of outrage, even in the absence of malicious intent?
The logical (to me) conclusion is that people need to unbunch their panties and stop looking for things to be offended about.
I don't think it's reasonable for anyone to expect an AI system, at our current level of technology, to have a complete understanding of every nuance of human pique to the degree where it will never do anything which could be interpreted as offensive by anyone.
Hell, that's a far higher bar than we humans can hope to meet in today's 'outrage culture'.
Yes, I think people overreacted a bit. But part of the backlash was because if these algorithms were better trained on a diverse set of faces, they might not have made that mistake. I think that's a fair criticism.
> if these algorithms were better trained on a diverse set of faces, they might not have made that mistake.
Is this assumption based on something specific? I.e. are there good reasons to believe that they weren't trained on a diverse set of people and that such training would prevent this error from happening?
People in general look very much like gorillas. They are close cousins after all. Telling humans and other apes apart can't be very easy for a poor, overworked neural network.
Also why are people offended by gorillas, or apes. Imagine being mislabeled as a bear or a giraffe, would people get as offended? What about bring identified as a parrot?
I get where you're coming from, but black people being compared to apes is kind of an age old racist trope, so it is understandable that people would be unhappy about it.
Being mislabeled as a bear on the other hand does not generally carry the negative connotations as calling people apes, so I suspects that this is why CapitionBot is erring on this side of the classification.
I'm aware people may want to _use_ it as a racist tool --but I fail to understand that transformation. We're all apes[1] the offender and the target/victim. So more than anything, it shows ignorance by the victimizer/bigot.
I suppose it's like calling someone a Neanderthal, but with racial implications. Interestingly, we're learning Neanderthals were not as cognitively lacking as perhaps some supposed.
It's like saying, hey, you are hairy... Right, we're all hairy (bald or not) with very few exceptions. It's a weird thing.
[1] In Indonesian lang. people and apes share the "orang" root. Oran, orangutan, orangnakal, etc.
I don't think there is as much to understand as you make it out to be. People feel insulted when called apes, because it is frequently used as an insult.
It may well be taxonomically correct to say that humans as well as apes are primates and members of the Hominidae family, but that doesn't make it any less insulting.
Just because something is technically correct doesn't mean it isn't dehumanizing and suitable as an insult. I'm sure nobody would like to be called a sack of meat either.
Still, this is a machine that's doing the labelling, and one would have to think someone purposely programmed it to do that, rather than the engineers lacking foresight to see a possible misrecognition.
C'mon, there's a long history of calling black people monkeys or apes as a way to remove their humanity. It's willful ignorance to ignore that history.
Imagine this tech matures and it can be incorporated with bodycams for police, when confronting a subject with objects in their hands it may be able to confidently estimate the probability of being a firearm or not, with better predictability than the police/people.
While we're here, let's go the full way and set up a proveable and public way to train a robocop, and I'd trust that more than a human cop. The awkward moment when AIs have more brains than cops (at least under the US system).
Likely before we get to robocops the "robocops" will be integrated into people who have proven risky by either previously known behavior in addition to social signals.
So Jane truant with convictions of petty theft or battery gets off with probation if she agrees to embed her own personal "robocop". Yes invasion of privacy, etc. But the alternative for her would be time in the pen, for example. So in this case people can become their own robocops who turn the host in to authorities if certain conditions are met (engages in previously restricted activities).
I think this is more likely than a roving robotic cop which looks out for misdeeds.
God only knows. I guess if he shows up at school and people say, "hey, you look just like that Microsoft kid!", I'd feel slightly guilty. Seems unlikely, and I'd have reasonable grounds for a C&D.
I uploaded a photograph of a bunch of snow piled on top of a round table[0], which looks a lot like a marshmallow to the human eye. But it came back with "I am not really confident, but I think it looks like a polar bear lying in the snow." Not terrible :)
First of all congratulations on a) the science (built on the shoulder of giants...) b) the accessibility / interface and service
Wondering if you plan to open up a caption API of any sort? Can definitely use something like this. If you desire the training feedback, then that could be added as well as part of the API. I'd be willing to do that for some images. So if you do add a training feedback API, please make it optional.
Have you considered an 'abstract' type of version? As in, rather than simply describing the image as the caption, take the information that would be used, fill it in a MAD-LIBS[1] style setup, and see how that turns out? Maybe it's just me but a surrealist CaptionBot could be pretty fun.
Thank you very much for putting your work out there for people to have fun with (and criticise!).
CaptionBot seems to have a bit of trouble with simple two-colour outline drawings. In one case I saw it even get the colour wrong ("red and white" for a black and red image).
Is that something you would have expected?
Also- I notice it doesn't do very well with character recognition either. Is that surprising?
It would be great if you added a text box under the stars so I can (optionally) tell you why I didn't give you five stars.
For example, I uploaded a picture of my daughter as an infant, and it said, "This is a baby on a bed and he's :D" which I gave four stars to because it said he instead of she. But honestly you really have no way of knowing that. :)
I worked in Image Processing and Vision for a long time. If you'd asked me 2 years ago that something like this could be possible, I would have laughed you out of the room. But in the last year or so, I've been stunned beyond belief at how well these networks work.
I got the same response not really confident but think it's a cell phone. Mine was a cluster of little buildings with gardens on top. Your is a ketchup bottle. Maybe 'cell phone' is the default response when it doesn't know.
It's only a matter of time before a repeat of Microsoft's last AI experiment (Tay), when the Internet teaches CaptionBot all of the positions in the Kama Sutra.
I had to share this one, I sent a screenshot of the Yahoo homepage from a while ago (yeah, I had that hanging around...) with the main image being of Donald Trump.
The caption guess was "I am not really confident, but I think it's a television screen and he seems ."
My lab is trying to do something similar for answering questions about images. We have a significantly better system than the current system that's online, but we haven't had a chance to update it yet: http://askimage.org
It is far from perfect, but is near state-of-the-art. I'm guessing it won't hold up to HN.
It is almost as smart as a child. I uploaded a picture of my Notre-Dame vacation photo, and the caption was "A person standing in front of a church"... which is close to my sons "mommy standing in front of that church we went to"
Yep, on the stuff it recognizes. It recognized a picture I tested taken from the Grand Tetons as "a lake with a mountain in the background", which was quite correct, but also kind of generic.
On the other hand, it described a picture of Grand Prismatic Springs in Yellowstone as "a train with smoke coming out of the water." Which also is kind of like the crazy things kids sometimes say when they see something new.
As far as I know this was the first research to do the super cool thing to combine multiple neural nets trained on different data in super cool ways:
"Now, what if we replaced that first RNN and its input words with a deep Convolutional Neural Network (CNN) trained to classify objects in images? Normally, the CNN’s last layer is used in a final Softmax among known classes of objects, assigning a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN’s rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image."
AND
"Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding"
Android users: do you get a lack of memory/resources error when you try to take a pic instead of selecting from gallery? It is a silly bug where the camera activity kills the browser activity that called it.
Google: we cannot move forward with 'Progressive Webapps' if you guys don't fix these silly bugs.
Take-picture-do-something is a common feature of webapps like here Mr CaptionBot!!
A couple of days ago I think there was a post about Google doing a lot of development and research around creating systems that understand / categorize / comment / recognize images.
One thing I took away from reading about it is that Google has billions of images to train it with from all their different ventures.
Does Microsoft have access to anywhere near the same numbers of pictures?
The Chinese company has structured a deal with Getty to take over licensing outside of China.
However, having access to world class photography is great, but the images that (probably) will be the most interesting for Microsoft to recognize will be selfies, and other "crowd" created amateur photography and possibly memes.
I personally would think it to be cool to see if the bot could traverse the Getty collection and see if it could recognize the photographer of an image it had not seen before. Why yes, this is Leibovitz.
I wish services like this would be released without any kind of moral filter on the subjects it classifies.
I uploaded a picture of Michaelangelos David to the service to see what captionbot would say about it, and I got back a message "I think this may be inappropriate content so I won't show it."
It feels like there are two sides of this: either recognition is amazing, either is really really far.
It seems that after it generates the caption, this needs to be fed to some semantic pipe, so that a plane sitting on a book would not make sense, and try further.
After all, it really depends on the training data. If the picture of a train ticket was never seen by the NN, how could it answer correctly? How ever, it should try to reduce the answer to some more meaningfull info, for example instead of two giraffes near a tree, ideally would have said, it's a text and would attempt OCR.
I gave it a photo of a Cylon [0] and it said "I am not really confident, but I think it's a close up of a motorcycle." Close but not really there; Google's reverse image search has a better detection in this case. As an aside, it'd have been really cool if it said it was a picture of a toaster.
Pretty impressive - gave it a few profile photos and it did suprisingly well, correctly identifying "A couple walking on a beach at sunset," "a man looking out a window", etc.
It struggled with wildlife photos - a pack of arctic wolves was "a sheep standing in the snow", and penguins swimming was "a bird flying over a body of water" (close but no cigar).
(I tried the spacex landing pictures too - it correctly identified "a boat in a large body of water" but ignored the ten-story rocket above said boat.)
My results ranged from impressive to awful. It recognized Pete Carroll with 96% accuracy from a meme picture where he struts and chews gum. Then it thought a picture of the super bowl field before the game was boats on a table.
My photos did not do too well. My Coral looks like a cake, my lizard looks like a bird, my boy fishing looks like a man next to a river, and a waterfall looks like a close up of Rock.
I uploaded the sad Michael Jordan meme face and it responded "I think it's Michael Jordan wearing a suit and tie and he seems :(", sounds about right...
So I looked for a random photo on my phone and fed it a picture of a spot my leg that I'm keeping an eye on. Close-up of a cat apparently. Damn these hairy legs.
I don't mean to offend, but I'm left wondering if the creators of image recognition services disincentivize their neural nets from recognizing something as an ape, gorilla or chimpanzee so as to avoid the same mistake Google made when it falsely recognized black people as gorillas [1].
[1] http://blogs.wsj.com/digits/2015/07/01/google-mistakenly-tag...