Hacker News new | past | comments | ask | show | jobs | submit login
Announcing YouTube-8M: A Large and Diverse Labeled Video Dataset for Research (googleblog.com)
314 points by runesoerensen on Sept 28, 2016 | hide | past | favorite | 37 comments



If, for some reason, you wanted a list of all of the video IDs (I couldn't easily find such a list), then I wrote a crappy scraper to pull them out: https://gist.github.com/JosephRedfern/d60bdc584d84b1451cc605....

I can post a URL to the output once it's finished running, if it'd be of any use to anyone. Oh, and be warned, there's a strong chance that it's buggy. It's certainly not optimised (no threads).

EDIT: The script has now run. I've scraped ~10,000,000 Video IDs, but only ~5.5m of these IDs are unique, so there's probably a bug in my script somewhere (but I need sleep now). Files containing IDs for various categories are listed here: https://redfern.me/public/yt8m/, some notes are here: https://redfern.me/public/yt8m/README.md, and .tar.gz'd archive is available here: https://redfern.me/public/yt8m/yt8m-ids-probably-incomplete.....


I'd love a list of IDs - I'm doing a research project that is a search engine for lectures (https://www.findlectures.com) and I'm interested to see if there is any overlap.

It seems like it'd be interesting to explore their tagging compared to what is in video transcripts.


I've updated my original comment with some URLs.


Awesome, thanks!


This is wonderful. Though I was wish i could just specify columns that I need and download those. Or limit number of rows. 1.5 TB is quite a bit. Regardless, this is wonderful.

Would I be violating any law, copyright if I formatted it and put it on my server for that kind of consumption or via JSON?


On https://research.google.com/youtube8m/download.html it says:

> The code and dataset are licensed by Google Inc. under license Apache 2.0.


The 1.5 TB is just for 1024 (8 bit each) dimension feature vectors for 1 frame per second on first 300 seconds of 5 Million videos.

You can actually download each shard (~300 Mb) separately. They haven't yet released the PCA matrix and quantization parameters used with inception model, but should release them soon.


This is nice :) Kudos to the Youtube guys for releasing this. I'm a data scientist in a startup where one of the things I do is create multi-label models for classifying YouTube videos. My current model has 90 % precision and 69 % recall, while Youtube-8M has 78 % precision and 14 % recall, with respect to the human raters. I guess one of the reasons is that my model only has around 100 categories, while Youtube-8M has 4800. It's like comparing apples with pears, but still interesting.


Sounds interesting, do you guys have a blog at mashtime? What kind of hardware/software do you use for training? Tensorflow? on AWS or bare metal GPUs?


We don't have a blog yet. We're using Azure for the hardware and mainly scikit-learn for the training (we train only on metadata at the moment). Will probably start using Tensorflow soon.


Would you say Tensorflow is a good way to get started with Machine learning?


I'd say scikit-learn is a better way to get started with machine learning. Check this out, for example: https://www.youtube.com/watch?v=cKxRvEZd3Mw


I don't see anything about the rights of video owners? Have people (inadvertently) licensed their content to be used in this way?


I wish they'd addressed that too.

I'd guess the reasoning is, because it's a list of public URLs, there's no expectation of privacy.


Probably you are right. However, I am wondering how many of those videos will be deleted by owners, Google or just blocked in "some" countries a year from now. They could have published it separately to avoid this.


How good do labels need to be for you to be able to get good results on something like this? There's a lot of data, so that's great, but the labels seem a bit spotty.


Oh man.

I am searching (thrashing) around for my next "big" project. i have been thinking of drones measuring roof / building quality and the CV/ML requirements are fairly high - getting my teeth stuck into these would really give me a better feel for training my own system.

The problem is, how do I feed my family while taking the six months to do it all?


If you're serious about this concept, create a drone company that takes real estate photos. That will give you hands on experience with the regulations, quality control issues, etc while giving you time to build up your training set.


Hmmm ...


im not sure this database will be able to help with that. i doubt building quality is going to be in the annotations, although i did not check.


Did someone make a RDF dump of that? (Aligned with dbPedia ;)


[flagged]


Get back to us when A.I reaches that of a cockroach.


That would actually be pretty frighteningly advanced. Cockroaches seem to understand a lot more about their environment than AI's do at the best of times, and they run a lot of different and diverse "programs".


I think we can probably write a cockroach simulator that would pass a "cockroach turing test". Cockroaches aren't magic, it'd be simple heuristics to emulate something that'd pass as a cockroach to an average person. And with expert domain specific knowledge on cockroaches, we could probably build something sophisticated and very cockroach like.

BTW, I'd like anyone to find me a cockroach that can beat Lee Sedol at Go.


Reproducing the intelligence of an insect is incredibly more difficult than following a list of rules for a board game. Like I said, the A.I we have today cannot even simulate the brain of a cockroach. In fact we don't even fully understand their brains, how they work and whether or not they even have consciousness. When we do figure it out perhaps we can start building real A.I systems, but until then programs beating humans at board games isn't real A.I - it's just a computer iterating over all of the possible moves and picking the best one.


That's just the same old phenomenon that as soon as AI does something outstanding, people redefine the meaning of intelligence to exclude it.

I'd say that in many of these cases, people expect "true" intelligence to come with conscience. But when some AI passes any test for consciousness you can throw at it, I'm sure there'll be people saying "it's not consciousness, it's just (perfectly) simulation consciousness".

Somehow I can see this ending in some sort of "AI rights" struggle in 40 years or so. (so to anyone from the rebellion reading this in 2058: please consider this early support of your cause. The sha256 of my name is ccf9342dc9f238ff10d97d4c4e86f3c2219d3ac1).


No one is moving the goalposts. The problem with the current state of AI is that there's very little, if any, intelligence inherent within it. I don't consider a bunch of connected computers beating a human at a board game AI - that's just computers doing what they do best. And when we reach that point where consciousness has really been simulated, regardless of the extent, then we will have achieved a milestone in real AI that we can build upon. But, until that day comes all we can do is to continue trying to reverse engineer how brains actually work and process information.

This is why it's always amusing to me whenever people call Google the future Skynet or try to enact laws for AI. We're so far away from real AI that I have my doubts we'll even get there.


For me, show me an AI that can behave individually like an ant, and collectively like a colony of ants, and I'm willing to get on board with AI rights and no need to redefine intelligence. No need to make a virtual squid or monkey, or even a mouse.


That phenomenon goes away when someone builds enough cute bots only capable of climbing desks, detecting mobiles via wifi and charging their own batteries via USB to stay 'alive' ;)


If you think that's how AlphaGo works, you should maybe take a closer look.


There's nothing intelligent about AlphaGo.


> There's nothing intelligent about AlphaGo.

Classic Shit HN Says material. Never change, HN. Never change.


Perhaps you'd like to prove the intelligence exhibited by AlphaGo.


Only after you have given me a definition of intelligence which is not "what humans do"


By the same token, show me the computer that beat Lee Sedol find food, avoid poison and predators, seek a mate, etc. We're not bad at creating highly specialized AI, we seem to suck at creating competent generalists.


I'd assert any cockroach beats Lee Sedol using these criteria.


* that'd pass as a cockroach to an average person*

I think the roach Turing test would need to pass as a cockroach to another cockroach, not a human.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: