Hacker News new | past | comments | ask | show | jobs | submit login
Deep Learning Image Classifier (toronto.edu)
149 points by adid on July 29, 2014 | hide | past | favorite | 32 comments




Not too bad! It was low probability, but it did somehow recognize Mike.


Didn't give me results at all to the three images I uploaded. Might be broken.


It looks like that's the case -- none of the example images on the front page work for me.


Same for me. Tried safari and chrome.


I think it's rescaling all images to fit the training size. If that is the case, then when your image has very different dimensions it gets distorted and confused. Try something with a height/width ratio like the samples.


President Obama is recognized either as ...

... a mountain-bike / all-terrain-bike (http://cdn2.spiegel.de/images/image-730849-galleryV9-vuuv.jp...)

... or a rugby ball (http://cdn2.spiegel.de/images/image-730849-breitwandaufmache...)

... or a bullet proof vest (http://cdn2.spiegel.de/images/image-730849-thumb-vuuv.jpg)

I guess the implementation leaves room for improvement :)


There are several of these image classifiers now that someone should run an accuracy/speed/price comparison between

AlchemyAPI: http://www.alchemyapi.com/products/demo/alchemyvision/

UToronto

Rekognition: http://rekognition.com/demo/concept

Clarifai: http://www.clarifai.com/

I'm doing it myself, but I have a conflict of interest


Well there is the ImageNet challenge http://www.image-net.org/challenges/LSVRC/2013/results.php I'm not sure if Alchemy or Rekognition maps to any of those teams though.



http://rekognition.com/demo/concept

Rekognition API has a similar API for all developers free.

It's reliable and very fast.

Checkout their demo page.


I only tried one (hard) image, pizza-, sandwich- bloody mary https://imgur.com/30OgNdd. Rekognition seems to be working better than submission

Rekognition:

7.55% fruit; 0.92% dinner; 0.88% produce; 0.87% alcohol; 0.84% sliced

Toronto:

50% American lobster, Northern lobster; 12% plate; 7% crayfish, crawfish, crawdad; 7% Dungeness crab, Cancer magister; 4% king crab, Alaska crab; 4% butcher shop, meat market; 4% grocery store, grocery; 4% pomegranate

I find this interesting because I thought Hinton's group had state of the art tech. Who are these people and how do they do it?


I think that when you're not expected to publish any papers to rationalize what you're doing, you're free to use any possible ugly hack to improve your results, (using a "kitchen sink" approach where you just combine the results of lots of unrelated techniques, extracting words from the URL, using the URL to actually fetch some related textual content on the website, etc). This gives private companies a competitive advantage over research institutions - their only purpose is to "make things work", not to introduce new techniques and have interesting insight about them.


Lots of companies and teams are exploring deep neural network with all kinds of application. Rekognition API is the only one I found that provide open API service right now. You could train classifier using your own images. But you need to create an account and upload your images using their web application.


tried two images:

http://kephra.de/Dampf/IMG_20140620_133839_800x600.jpg <- an ecigarette, and the classifier thought its a fountain pen. Well thats not bad, I got this joke/question from humans also.

http://kephra.de/pix/Snoopy/thump/IMG_20130822_135928_640x48... <- here it thought its a speed boat ... well my boat is fast, but not a speedboat, but an sailing boat. It offered several more boat types, but not just a plain sailing boat. Interesting here is that the last suggestion of only 1% could be considered right as "dock, dockage, docking facility"

Tried some other images from the lifestyle section of my homepage, but it looks as if the system newer saw a sewing machine before as it gives "Low recognition confidence", and no tags.


I can see how it could get speedboat from the shape of the hull.


It seems strange that they would include in their set of example images, a picture of the most famous mausoleum in the world, without it being tagged with mausoleum or tomb or anything like that.


And it is tagged 99% mosque, while it isn't one. (the building on the left of it, not in the image, is).


If I uploaded my own picture of the Taj Mahal and it told me it was a Mosque, I wouldn't be surprised, and I'd probably be reasonably impressed. The dome and minarets do rather give that impression, and I wouldn't really expect a computer to be able to tell the difference.

The reason I find it odd is that I would expect the first example on a demo to be carefully chosen to show off the system in the best light. It would be one that has perfect or near-perfect tagging. Maybe later on, I would show the shortcomings with a tricky image like this.


Are there actually any image feature detectors and descriptors involved (like blob, edge and texture detectors) or is this solely based on artificial neural networks?


Interestingly it has been shown that the result from some neural networks is equivalent to using classification with some predefined filters. These filters could be considered as a feature descriptor. See this talk from CVPR http://techtalks.tv/talks/plenary-talk-are-deep-networks-a-s....


Thanks for sharing this. I enjoy Mallat's point of view. He has some similar talks on videolectures.net for anyone who's interested.


AFAIK, it's using a Deep Neural Network; which means, the inputs are, basically, pixel values (possibly normalized), and all feature detection, etc. is done in the layers of the network.


yep, they try to learn an image's high level features by learning an autoencoder (that is a transform that takes an image and tries to produce the same image) via a sandglass shape multi layer network. Here is a very readable paper by Hinton himself that describes the approach:

http://www.cs.toronto.edu/~hinton/science.pdf


I'm pretty sure there's not an autoencoder involved, it just looks like a vanilla conv net.

This is the implementation: http://torontodeeplearning.github.io/convnet/


Could it maybe be worthwhile to augment the data with simple image features? E.g. the human visual system is believed to rely on high-level/top down as well as on local/bottom up features (although that might also be simply because of the necessity to compress things for the low nerve count in the optical nerve).


A Deep Net (to be specific: a deep belief network which is a series of stacked RBMs, not Stacked Denoising AutoEncoders for clarification that there's a difference) usually can benefit from a moving window approach (slicing up an image in to chunks) to simulate a convolutional net. This can help a deep net generalize better.

That being said: even deep learning requires some sort of feature engineering at times (even if its pretty good with either hessian free training or pretraining).

The main thing with images is ensuring scaling them.

The trick with deep belief networks in particular is to make sure the RBMs have the right visible and hidden units (Hinton recommends Gaussian Visible, Rectified Linear Hidden).

Happy to answer other questions as well!


I think it is a convolutional network trained only with gradient descent, since pressing source code links to convnet project.


What data was it trained on?

Also can it tell you where in the image the identified object is?


This was pre-trained on ImageNet classes. You can find more information here: http://www.image-net.org


My results (yeah, a tough image) http://imgur.com/pbH52xW


From my experience with cats, "doormat" is actually pretty accurate. Damn things always dart right under my feet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: