Announcing YouTube-8M: A Large and Diverse Labeled Video Dataset for Research

JosephRedfern · on Sept 28, 2016

If, for some reason, you wanted a list of all of the video IDs (I couldn't easily find such a list), then I wrote a crappy scraper to pull them out: https://gist.github.com/JosephRedfern/d60bdc584d84b1451cc605....

I can post a URL to the output once it's finished running, if it'd be of any use to anyone. Oh, and be warned, there's a strong chance that it's buggy. It's certainly not optimised (no threads).

EDIT: The script has now run. I've scraped ~10,000,000 Video IDs, but only ~5.5m of these IDs are unique, so there's probably a bug in my script somewhere (but I need sleep now). Files containing IDs for various categories are listed here: https://redfern.me/public/yt8m/, some notes are here: https://redfern.me/public/yt8m/README.md, and .tar.gz'd archive is available here: https://redfern.me/public/yt8m/yt8m-ids-probably-incomplete.....

garysieling · on Sept 28, 2016

I'd love a list of IDs - I'm doing a research project that is a search engine for lectures (https://www.findlectures.com) and I'm interested to see if there is any overlap.

It seems like it'd be interesting to explore their tagging compared to what is in video transcripts.

JosephRedfern · on Sept 29, 2016

I've updated my original comment with some URLs.

garysieling · on Sept 29, 2016

Awesome, thanks!

chirau · on Sept 28, 2016

This is wonderful. Though I was wish i could just specify columns that I need and download those. Or limit number of rows. 1.5 TB is quite a bit. Regardless, this is wonderful.

Would I be violating any law, copyright if I formatted it and put it on my server for that kind of consumption or via JSON?

magicalist · on Sept 28, 2016

On https://research.google.com/youtube8m/download.html it says:

> The code and dataset are licensed by Google Inc. under license Apache 2.0.

aub3bhat · on Sept 28, 2016

The 1.5 TB is just for 1024 (8 bit each) dimension feature vectors for 1 frame per second on first 300 seconds of 5 Million videos.

You can actually download each shard (~300 Mb) separately. They haven't yet released the PCA matrix and quantization parameters used with inception model, but should release them soon.

iverjo · on Sept 28, 2016

This is nice :) Kudos to the Youtube guys for releasing this. I'm a data scientist in a startup where one of the things I do is create multi-label models for classifying YouTube videos. My current model has 90 % precision and 69 % recall, while Youtube-8M has 78 % precision and 14 % recall, with respect to the human raters. I guess one of the reasons is that my model only has around 100 categories, while Youtube-8M has 4800. It's like comparing apples with pears, but still interesting.

tiplus · on Sept 29, 2016

Sounds interesting, do you guys have a blog at mashtime? What kind of hardware/software do you use for training? Tensorflow? on AWS or bare metal GPUs?

iverjo · on Sept 29, 2016

We don't have a blog yet. We're using Azure for the hardware and mainly scikit-learn for the training (we train only on metadata at the moment). Will probably start using Tensorflow soon.

doozler · on Sept 30, 2016

Would you say Tensorflow is a good way to get started with Machine learning?

iverjo · on Oct 2, 2016

I'd say scikit-learn is a better way to get started with machine learning. Check this out, for example: https://www.youtube.com/watch?v=cKxRvEZd3Mw

edent · on Sept 29, 2016

I don't see anything about the rights of video owners? Have people (inadvertently) licensed their content to be used in this way?

scott_karana · on Sept 29, 2016

I wish they'd addressed that too.

I'd guess the reasoning is, because it's a list of public URLs, there's no expectation of privacy.

shmel · on Sept 29, 2016

Probably you are right. However, I am wondering how many of those videos will be deleted by owners, Google or just blocked in "some" countries a year from now. They could have published it separately to avoid this.

tdaltonc · on Sept 28, 2016

How good do labels need to be for you to be able to get good results on something like this? There's a lot of data, so that's great, but the labels seem a bit spotty.

lifeisstillgood · on Sept 28, 2016

Oh man.

I am searching (thrashing) around for my next "big" project. i have been thinking of drones measuring roof / building quality and the CV/ML requirements are fairly high - getting my teeth stuck into these would really give me a better feel for training my own system.

The problem is, how do I feed my family while taking the six months to do it all?

timClicks · on Sept 28, 2016

If you're serious about this concept, create a drone company that takes real estate photos. That will give you hands on experience with the regulations, quality control issues, etc while giving you time to build up your training set.

lifeisstillgood · on Sept 28, 2016

Hmmm ...

misiti3780 · on Sept 28, 2016

im not sure this database will be able to help with that. i doubt building quality is going to be in the annotations, although i did not check.

lolive · on Sept 28, 2016

Did someone make a RDF dump of that? (Aligned with dbPedia ;)

kelvin0 · on Sept 28, 2016

[flagged]

bitmapbrother · on Sept 28, 2016

Get back to us when A.I reaches that of a cockroach.

M_Grey · on Sept 28, 2016

That would actually be pretty frighteningly advanced. Cockroaches seem to understand a lot more about their environment than AI's do at the best of times, and they run a lot of different and diverse "programs".

computerex · on Sept 28, 2016

I think we can probably write a cockroach simulator that would pass a "cockroach turing test". Cockroaches aren't magic, it'd be simple heuristics to emulate something that'd pass as a cockroach to an average person. And with expert domain specific knowledge on cockroaches, we could probably build something sophisticated and very cockroach like.

BTW, I'd like anyone to find me a cockroach that can beat Lee Sedol at Go.

bitmapbrother · on Sept 29, 2016

Reproducing the intelligence of an insect is incredibly more difficult than following a list of rules for a board game. Like I said, the A.I we have today cannot even simulate the brain of a cockroach. In fact we don't even fully understand their brains, how they work and whether or not they even have consciousness. When we do figure it out perhaps we can start building real A.I systems, but until then programs beating humans at board games isn't real A.I - it's just a computer iterating over all of the possible moves and picking the best one.

matt4077 · on Sept 29, 2016

That's just the same old phenomenon that as soon as AI does something outstanding, people redefine the meaning of intelligence to exclude it.

I'd say that in many of these cases, people expect "true" intelligence to come with conscience. But when some AI passes any test for consciousness you can throw at it, I'm sure there'll be people saying "it's not consciousness, it's just (perfectly) simulation consciousness".

Somehow I can see this ending in some sort of "AI rights" struggle in 40 years or so. (so to anyone from the rebellion reading this in 2058: please consider this early support of your cause. The sha256 of my name is ccf9342dc9f238ff10d97d4c4e86f3c2219d3ac1).

bitmapbrother · on Sept 29, 2016

No one is moving the goalposts. The problem with the current state of AI is that there's very little, if any, intelligence inherent within it. I don't consider a bunch of connected computers beating a human at a board game AI - that's just computers doing what they do best. And when we reach that point where consciousness has really been simulated, regardless of the extent, then we will have achieved a milestone in real AI that we can build upon. But, until that day comes all we can do is to continue trying to reverse engineer how brains actually work and process information.

This is why it's always amusing to me whenever people call Google the future Skynet or try to enact laws for AI. We're so far away from real AI that I have my doubts we'll even get there.

M_Grey · on Sept 29, 2016

For me, show me an AI that can behave individually like an ant, and collectively like a colony of ants, and I'm willing to get on board with AI rights and no need to redefine intelligence. No need to make a virtual squid or monkey, or even a mouse.

noiv · on Sept 29, 2016

That phenomenon goes away when someone builds enough cute bots only capable of climbing desks, detecting mobiles via wifi and charging their own batteries via USB to stay 'alive' ;)

gliptic · on Sept 29, 2016

If you think that's how AlphaGo works, you should maybe take a closer look.

bitmapbrother · on Sept 29, 2016

There's nothing intelligent about AlphaGo.

sangnoir · on Sept 29, 2016

> There's nothing intelligent about AlphaGo.

Classic Shit HN Says material. Never change, HN. Never change.

bitmapbrother · on Sept 29, 2016

Perhaps you'd like to prove the intelligence exhibited by AlphaGo.

sangnoir · on Sept 30, 2016

Only after you have given me a definition of intelligence which is not "what humans do"

M_Grey · on Sept 28, 2016

By the same token, show me the computer that beat Lee Sedol find food, avoid poison and predators, seek a mate, etc. We're not bad at creating highly specialized AI, we seem to suck at creating competent generalists.

noiv · on Sept 29, 2016

I'd assert any cockroach beats Lee Sedol using these criteria.

mountaineer22 · on Sept 29, 2016

* that'd pass as a cockroach to an average person*

I think the roach Turing test would need to pass as a cockroach to another cockroach, not a human.