Extracting image metadata at scale

woodman · on March 21, 2016

For their image resizing tasks, I wonder if they've tried anything more complex than simply cropping around points of interest, something like seam carving [0]. I imagine that it would be pretty cheap to run a bunch of different algorithms on an image and then A|B test it on Amazon mechanical turk.

[0] https://en.wikipedia.org/wiki/Seam_carving, https://www.youtube.com/watch?v=6NcIJXTlugc, https://www.youtube.com/watch?v=AJtE8afwJEg

emerongi · on March 21, 2016

Interesting. From the Wikipedia article: "A 2010 review of eight image retargeting methods found that seam carving produced output that was ranked among the worst of the tested algorithms. It was, however, a part of one of the highest-ranking algorithms: the multi-operator extension mentioned above (combined with cropping and scaling)."

Here is the paper: http://people.csail.mit.edu/mrub/papers/retBenchmark.pdf

Don't have time to read the paper, but I wonder if it didn't perform well because of the algorithm that calculated the energy levels?

Edit: The paper seems to suggest that an algorithm for retargeting of streaming video [1] was rated highest by human viewers.

[1] https://s3-us-west-1.amazonaws.com/disneyresearch/wp-content...

microcolonel · on March 21, 2016

I'm somewhat surprised that the best known methods aren't neural nets.

bcook · on March 22, 2016

In 2010?

maciejgryka · on March 21, 2016

Seam carving is cool! And there are even better methods out there - for anyone interested, this is a great starting point: http://people.csail.mit.edu/mrub/papers/retBenchmark.pdf

All of these methods have advantages, but it's pretty hard to out-weight the simplicity of cropping, especially if you're sensitive to bad results.

tekromancr · on March 21, 2016

That looks amazing, and relatively easy to implement. However it seems that Mitsubishi owns a patent on it, so maybe we will start seeing it used in __ years when they expire.

sbarre · on March 21, 2016

Seam carving has been in Photoshop, and other applications, for years now (as is clearly mentioned in the linked Wikipedia article).

true_religion · on March 22, 2016

Seam carving is available via Imagemagick.

maciejgryka · on March 21, 2016

This is very interesting, but the real question is: how do you test which approach is better?

For example, in the text detection case there are almost unlimited combinations of transforms that you can put together. Usually you use some hybrid of gut feeling and results to decide, but I bet Netflix has enough data to make that call in a more principled way.

Would be awesome to hear about that. How do you create a labeled dataset? How exactly do you measure which approach is better? Is there a perceptual element to it, or is it all quantitative?

Edit: here's the related money quote from the retargeting paper linked in the other comment:

  "In terms of objective measures for retargeting, our results show that we are still a long way from imitating human perception. There is a relatively  large discrepancy  between such measures and the subjective data we collected, and in fact, the most preferred algorithm by human viewers, SV, received low ranking by almost all automatic distance measures."

Xyik · on March 21, 2016

don't think it actually talks much about how it does it at ' scale'. how expensive is it to perform these operations? are images cropped dynamically as they are requested or do they pre-process the images and cache it somewhere.

did they do anything clever to parallelize the process? what underlying technologies do they use...

maciejgryka · on March 21, 2016

From the code samples, it looks like OpenCV... which is pretty hard to beat for well-understood image processing algos like thresholdling etc.

I guess at this point you can do it "at scale" by throwing enough servers and caching at the problem :)

zitterbewegung · on March 22, 2016

It would be more interesting to see what they are using to manage the servers and run opencv which is what the comment was asking .

zetazzed · on March 21, 2016

The authors may want to consider how much of this work could be done easily and effectively with deep learning. For content-based search and image similarity, even simple, pre-trained convnets will likely crush the histogram-based approaches you have here.

Just run your images through Google Cloud vision to do the face detection and text detection. With 2M images, it will be cheaper than the amount of dev time you spent here, and you'll get excellent quality.

TTPrograms · on March 21, 2016

They explain that not all of the images they want are faces in this case, so you'd have to train your own on "interesting regions" (though there is some work in that area). Part of the challenge in that case is generating all your labels for what the interesting regions are. This way they don't need to generate labels, at least.

aab0 · on March 21, 2016

Youtube did something similar for 'interesting thumbnails' last year with deep nets (many uploaders do not specify a good thumbnail preview), and reported that it gave a nice performance boost.

shekkizh · on March 21, 2016

My thoughts as well. With the recent advances doing image similarity and content based search should be fairly simple and probably more effective with pre-trained convnets.

maciejgryka · on March 21, 2016

It would be awesome if they released this data for another edition of the Netflix challenge, Then we could all try this ourselves!