Hacker News new | past | comments | ask | show | jobs | submit login
Improving YouTube video thumbnails with deep neural nets (youtube-eng.blogspot.com)
99 points by jplevine on Oct 9, 2015 | hide | past | favorite | 25 comments



I wonder if they included sexy images in their negative training sets -- many videos accrue millions of views (and ad dollars) by having a few frames of cleavage interspersed with other (often derivative) footage.

It would be great if their algorithm picked a thumbnail that reflected the entire video, not just a few frames specifically chosen to game people's compulsive clicking.


Most of those are just manually selected thumbnails by the uploader. After uploading, YT gives you 3-5 thumbnails you can choose from.

Also, partnered accounts are allowed to upload custom thumbnails (which can be any image, not necessarily even a screenshot from the video).


Not just partnered accounts. My YouTube account lets me upload custom thumbnails, and I'm certainly not followed. I have maybe a couple dozen videos with maybe a few hundred views among them all.


Is this definitely not algorithmic? I've been noticing for a while that videos might have an incidental flash of cleavage and then that is used as the thumbnail. I'd always wondered if this was arising "naturally" somehow (people pausing that scene perhaps?)

On the basis of the type if video I'd discounted manual intervention. Though if people can just upload any image I'm now surprised that they're not all like this.


> After uploading, YT gives you 3-5 thumbnails you can choose from.

Can you pick an arbitrary video frame, or only one of the suggested thumbnails?


It automatically captures 3 different thumbnails (I guess using the algorithm in OP) and lets you select any 1 of those 3.


I presume they use the image selection as training data too—if not that seems like awfully low hanging data fruit.


Many videos seem to have completely arbitrary thumbnails which are not from the video. Most of the Epic Rap Battles videos, for example.

Perhaps this option 'unlocks' after you reach a certain subscriber count.


As mentioned upthread:

> partnered accounts are allowed to upload custom thumbnails (which can be any image, not necessarily even a screenshot from the video).


They stated that the negative training set was constructed by randomly sampling frames from the video.

If someone wants to game the thumbnails, then they will just manually select the thumbnail to use; and there are to many legitimate use cases for this ability for Youtube to remove it.


> If someone wants to game the thumbnails, then they will just manually select the thumbnail to use; and there are to many legitimate use cases for this ability for Youtube to remove it.

Many channels I watch carefully select an iconic frame from the video to serve as the thumbnail, or construct an artificial thumbnail that provides useful information about the type and subject of the video. Manual will frequently produce better results than automatic for a good-quality channel.


Is there a way YouTube could alter the "view count" to only include views where 100% of the video has been watched? May help cut down on videos with misleading thumbnails and/or titles.


> Is there a way YouTube could alter the "view count" to only include views where 100% of the video has been watched?

You wouldn't want to require 100%, as many people stop when a video starts rolling credits, or when it switches to a screen using annotations to link to other videos. But 50-75% would work well as a threshold to count "views".


A better way to filter those out algorithmically would be to simply look at the thumbs-up vs thumbs-down ratio.

The ones with misleading titles/thumbnails often have far more down-votes than up-votes yet YouTube continues to show those as the highest recommended/relevant (I guess Google prefers click-throughs over user-satisfaction).


Both explicit (thumbs up/down) and implicit (click-aways / closing window) may count toward quality.

There are other confusing cases. I watch a lot of long-form videos, some too long to view in a single session, many of which I download for offline viewing (yt-download). I've been quite actively dissuaded from either publicly rating videos, or even linking to YouTube itself on my primary social channel (G+) given the Anschluss forced-marriage between YouTube, G+, and what had once been individual and separate accounts (similar logic applies to Google Play, and I've taken to "registering" my Android devices under randomly generated usernames).

For videos I particularly like, I may reference them, but only specific portions which I skip to, view, and then close. That's far less than a 100% view, but still significant.

It's not that I'm opposed to providing appropriateness and quality data to YouTube. I absolutely give massive shits about who they share that data with, and how. The "make it all public" default is utterly fucked in the head.

I think Google are starting to realise that.


> A better way to filter those out algorithmically would be to simply look at the thumbs-up vs thumbs-down ratio.

> The ones with misleading titles/thumbnails often have far more down-votes than up-votes

Especially once the total votes pass a certain threshold. Below a certain threshold, any activity makes something interesting; you wouldn't want to let a handful of downvotes bury something early on (as in, 4 upvotes and 6 downvotes). But once you hit the hundreds or thousands of votes, the ratio should take over.


It looks like they prefer images with a few large faces near the center of the frame. That's probably the right answer for social media. (Plus a cat recognizer.) Used on news footage, you probably get the talking head rather than the news event.


We can't guess as to how the NN is preferring images, but it looks to me like it's preferring images with a high entropy in certain regions


There's an outside company that was working on this: Neon Labs (https://www.neon-lab.com/).

Their insight is that not only are there images that are "high-quality", but also images that are positive. Positive images get more clicks, over just a decent image. I wonder if that information is encoded in the RNN in some way.

(This is where I'd normally rant about RNNs and other ML techniques hiding this information from their creators by locking it up inside the black box, but I'll save that for another day.)


They've got to be training on more inputs than mentioned. For example, is one or a close set of times in the video linked externally and generating traffic? Grab the entire set of frames from that time period and run it through the quality classifier, there might be iconic frames from that section that people are looking for.

Are people re-watching a small segment of the video? Try classifying individual frames from that segment or just before. Of course, those are often action moments that result in smeared motion and artifacts and may not result in a quality thumbnail.

These ideas also only come into play when a video has been live for a while, after the uploader has initially picked a thumbnail. Maybe a "We have some new thumbnail suggestions for you, take a look" alert or message?


So, in an article about image processing, why not include nice big beautiful images, that get even bigger when you click on them?

I click on the low detail inline images, and they stay the same disappointing size and reveal no further detail.

They're all, like 600px X 200px? Am I being greedy for want of gigantic images, upwards of 3000px wide?

I suppose it is an article about thumbnails, after all, so maybe I shouldn't be so surprised.


Seeing this run through an equivalent of the deep dream visualizer could be really interesting -- what _are_ people looking for in thumbnails? I'm having difficulty imagining what features would even be relevant in such a situation.


I'm guessing: "sharpness" of image, good saturation, presence of (smiling?) human faces, non-human mammals facing the camera, bare human skin (?)

(I agree that'd be cool.)


Meanwhile, I still can't edit a playlist while playing it.

edit: constructively put - there's simpler stuff to fix UX and match user patterns still isn't there?


When you have a big system, the most consistent argument against working on one thing is that you should be working on something else, this is true for everything in the system, because everyone has a different opinion on what that thing is.

For example: why should you spend time working on the playlist playback when youtube could instead spend time working on automatic categorization, content creators have to manually create playlists, even if they sequentially number their videos. Youtube shouldn't waste their time on playlist editing when it could be doing the right thing automatically.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: