For their image resizing tasks, I wonder if they've tried anything more complex than simply cropping around points of interest, something like seam carving [0]. I imagine that it would be pretty cheap to run a bunch of different algorithms on an image and then A|B test it on Amazon mechanical turk.
Interesting. From the Wikipedia article: "A 2010 review of eight image retargeting methods found that seam carving produced output that was ranked among the worst of the tested algorithms. It was, however, a part of one of the highest-ranking algorithms: the multi-operator extension mentioned above (combined with cropping and scaling)."
That looks amazing, and relatively easy to implement. However it seems that Mitsubishi owns a patent on it, so maybe we will start seeing it used in __ years when they expire.
This is very interesting, but the real question is: how do you test which approach is better?
For example, in the text detection case there are almost unlimited combinations of transforms that you can put together. Usually you use some hybrid of gut feeling and results to decide, but I bet Netflix has enough data to make that call in a more principled way.
Would be awesome to hear about that. How do you create a labeled dataset? How exactly do you measure which approach is better? Is there a perceptual element to it, or is it all quantitative?
Edit: here's the related money quote from the retargeting paper linked in the other comment:
"In terms of objective measures for retargeting, our results show that we are still a long way from imitating human perception. There is a relatively large discrepancy between such measures and the subjective data we collected, and in fact, the most preferred algorithm by human viewers, SV, received low ranking by almost all automatic distance measures."
don't think it actually talks much about how it does it at ' scale'. how expensive is it to perform these operations? are images cropped dynamically as they are requested or do they pre-process the images and cache it somewhere.
did they do anything clever to parallelize the process? what underlying technologies do they use...
The authors may want to consider how much of this work could be done easily and effectively with deep learning. For content-based search and image similarity, even simple, pre-trained convnets will likely crush the histogram-based approaches you have here.
Just run your images through Google Cloud vision to do the face detection and text detection. With 2M images, it will be cheaper than the amount of dev time you spent here, and you'll get excellent quality.
They explain that not all of the images they want are faces in this case, so you'd have to train your own on "interesting regions" (though there is some work in that area). Part of the challenge in that case is generating all your labels for what the interesting regions are. This way they don't need to generate labels, at least.
Youtube did something similar for 'interesting thumbnails' last year with deep nets (many uploaders do not specify a good thumbnail preview), and reported that it gave a nice performance boost.
My thoughts as well. With the recent advances doing image similarity and content based search should be fairly simple and probably more effective with pre-trained convnets.
[0] https://en.wikipedia.org/wiki/Seam_carving, https://www.youtube.com/watch?v=6NcIJXTlugc, https://www.youtube.com/watch?v=AJtE8afwJEg