I think it's rescaling all images to fit the training size. If that is the case, then when your image has very different dimensions it gets distorted and confused. Try something with a height/width ratio like the samples.
I think that when you're not expected to publish any papers to rationalize what you're doing, you're free to use any possible ugly hack to improve your results, (using a "kitchen sink" approach where you just combine the results of lots of unrelated techniques, extracting words from the URL, using the URL to actually fetch some related textual content on the website, etc). This gives private companies a competitive advantage over research institutions - their only purpose is to "make things work", not to introduce new techniques and have interesting insight about them.
Lots of companies and teams are exploring deep neural network with all kinds of application. Rekognition API is the only one I found that provide open API service right now. You could train classifier using your own images. But you need to create an account and upload your images using their web application.
http://kephra.de/pix/Snoopy/thump/IMG_20130822_135928_640x48... <- here it thought its a speed boat ... well my boat is fast, but not a speedboat, but an sailing boat. It offered several more boat types, but not just a plain sailing boat. Interesting here is that the last suggestion of only 1% could be considered right as "dock, dockage, docking facility"
Tried some other images from the lifestyle section of my homepage, but it looks as if the system newer saw a sewing machine before as it gives "Low recognition confidence", and no tags.
It seems strange that they would include in their set of example images, a picture of the most famous mausoleum in the world, without it being tagged with mausoleum or tomb or anything like that.
If I uploaded my own picture of the Taj Mahal and it told me it was a Mosque, I wouldn't be surprised, and I'd probably be reasonably impressed. The dome and minarets do rather give that impression, and I wouldn't really expect a computer to be able to tell the difference.
The reason I find it odd is that I would expect the first example on a demo to be carefully chosen to show off the system in the best light. It would be one that has perfect or near-perfect tagging. Maybe later on, I would show the shortcomings with a tricky image like this.
Are there actually any image feature detectors and descriptors involved (like blob, edge and texture detectors) or is this solely based on artificial neural networks?
Interestingly it has been shown that the result from some neural networks is equivalent to using classification with some predefined filters. These filters could be considered as a feature descriptor. See this talk from CVPR http://techtalks.tv/talks/plenary-talk-are-deep-networks-a-s....
AFAIK, it's using a Deep Neural Network; which means, the inputs are, basically, pixel values (possibly normalized), and all feature detection, etc. is done in the layers of the network.
yep, they try to learn an image's high level features by learning an autoencoder (that is a transform that takes an image and tries to produce the same image) via a sandglass shape multi layer network. Here is a very readable paper by Hinton himself that describes the approach:
Could it maybe be worthwhile to augment the data with simple image features? E.g. the human visual system is believed to rely on high-level/top down as well as on local/bottom up features (although that might also be simply because of the necessity to compress things for the low nerve count in the optical nerve).
A Deep Net (to be specific: a deep belief network which is a series of stacked RBMs, not Stacked Denoising AutoEncoders for clarification that there's a difference) usually can benefit from a moving window approach (slicing up an image in to chunks) to simulate a convolutional net. This can help a deep net generalize better.
That being said: even deep learning requires some sort of feature engineering at times (even if its pretty good with either hessian free training or pretraining).
The main thing with images is ensuring scaling them.
The trick with deep belief networks in particular is to make sure the RBMs have the right visible and hidden units (Hinton recommends Gaussian Visible, Rectified Linear Hidden).