Yes even within a particular video there are lots of frames where the act is implied not directly shown, like a close-up of others faces. Karpathy et al. showed they could still learn from the sports video database even with random crowd shots or announcer shots not being removed.
I think the quality for the data influences the result and hand crafting the dataset is what lead to 95% accuracy on new instances.
not really. 1 hour of video in 36 seconds it 1,000 hours of video / hour of computation. Assuming you go with a cluster of higher end graphics cards, you could pretty easily perform 100x better. That's 100,000 hours of video processed / hour of computation. I don't know the size of the pornhub back catalog, and I'm scared to search since I'm at work right now, but even if it's hundreds of millions of hours you could go through the whole thing in like 2 months tops.
I think thats doable. I'll be adding an autotag mode. I've been thinking about other attributes I can detect from race, to hair color, to number of participants.
Crowd sourcing is good except it can't tag new videos no one's seen yet.
This program can also be viewed as a general framework for classifying video with a Caffe model, using batching and threading in C++. By replacing the weights, model definition, and mean file it can immediately be used to edit videos with other classes without recompiling
Seriously, though, synthesis using a recognition model can be a good reality check to remind us of the shortcomings of the model's "understanding" of the domain.
https://github.com/ryanjay0/miles-deep/raw/master/images/pre...