Hacker News new | past | comments | ask | show | jobs | submit login

Little crafty trick: grab the flickr dataset (yfcc100m) and snag photos tagged as whatever you want. It's like a janky google image search. I've put together datasets of airplanes, bicycles, dogs, etc this way.

It's not entirely accurate, but it's good enough. Within a few hours you can have a pretty large dataset of whatever you want, really. (Yay for massive dataset plus user tags.)

You can snag a copy from my server, if you want. (Warning: It's a direct link to a 54GB json file.) https://battle.shawwn.com/sdc/f100m/yfcc100m_dataset.json

I have a script called "janky-image-search" that returns random results from searching this dataset. Here are a few random hits for `janky-image-search ceiling`:

https://www.flickr.com/photos/35981213@N00/9374577304

https://farm3.staticflickr.com/2084/2513539756_7a768de44c.jp...

http://www.flickr.com/photos/96453841@N00/7077625757/

https://www.flickr.com/photos/28742299@N04/4683245369/

etc.

The quality is hit or miss, but it seems better than 70%.

EDIT: Here's a gallery of the first 100 hits for "ceiling": https://cdn.gather.town/storage.googleapis.com/gather-town.a...

But as you can see, it's not effortless. Most of those ceilings are old. So it depends what you want. It's why labeled data is worth billions of dollars (scale.ai et al).




Super neat!

(Small detail: the first 100 hits for ceilings are all _interesting_ ceilings? Nothing like the kind of ceilings usually encountered on a video chat app?)


Why would anyone post something boring on a photography site? There's a bar for quality on flickr; it's not twitter.


Yeah, flickr might not be a good data set to train for boring images. Maybe we need a high quality repository of bad photos.


Of course, the point is that it may not be good data for training a model.


I was clarifying, since the above poster was using question marks.


I found that people don't upload misdirected ceiling photos and videos to YouTube and Flickr. How unfortunate. Yes, it's better than nothing.


Now you've made me want to create a random-ceilings dataset. I spent about an hour trying to think of some way to get you a bunch of un-curated ceilings, to no avail.

If you ended up using the ceilings uploaded to your app directly from your users, it might be possible that there wasn't any better solution than the one you came up with.


Perhaps if there's a database of those 360 photo globes you could crop out the upward-pointing part?


That dataset will work well for detecting nudity in Zoom calls from the Sistine Chapel :)


Before I download 54gb from your server, can you tell me the structure of the json?

What is a good tool to search it without 64gb of free memory?


My janky-image-search script looks like this:

  tag="${1}"
  shift 1
  curl -fsSL https://battle.shawwn.com/sdc/f100m/yfcc100m_dataset.json | jq '{user_tags, machine_tags, description, title, item_download_url, item_url}' -c | egrep -i "\"($tag)\"" "$@"
Basically, it trades bandwidth for memory. And hetzner has free bandwidth.

Here's curl -fsSL https://battle.shawwn.com/sdc/f100m/yfcc100m_dataset.json | head -n 1, which should give you all the info about the structure. Or most of it.

  {"date_taken": "2013-08-10 11:05:54.0", "item_farm_identifier": 3, "item_url": "http://www.flickr.com/photos/9315487@N04/9497940823/", "user_tags": [], "title": "IMG_7216", "item_extension_original": "jpg", "latitude": null, "user_nickname": "meowelk", "date_uploaded": 1376370010, "accuracy": null, "machine_tags": [], "item_server_identifier": 2856, "description": null, "item_secret_original": "15f4257122", "user_nsid": "9315487@N04", "license_name": "Attribution-NonCommercial-ShareAlike License", "capture_device": "Canon EOS 7D", "item_marker": false, "item_id": 9497940823, "longitude": null, "item_download_url": "http://farm3.staticflickr.com/2856/9497940823_0c0d854111.jpg", "license_url": "http://creativecommons.org/licenses/by-nc-sa/2.0/", "item_secret": "0c0d854111"}
I just dump those to disk and then use `jq` to extract the item_download_url.

FWIW, the full ceiling search finished with 9,325 results. Here's the first 1k.

https://cdn.gather.town/storage.googleapis.com/gather-town.a...


Thanks for the information. So the 54 gigabytes is just plain text JSON? How big is the full image dataset?


Good question. It's 100m images, but I haven't tried to find how large all of it is.

We downloaded a subset of 3 million images, which is apparently 379.77 GiB. So a linear extrapolation would be (100/3*379.77) = ~12,659 GiB for the full 100m images.

12TB really isn't too bad. It's massive, yes, but imagenet 21k is 1.2TB.


Wow I immediately recognized that first one as the ceiling of a Washington DC Metro station.


I don’t need it, but that’s really awesome of you!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: