Little crafty trick: grab the flickr dataset (yfcc100m) and snag photos tagged a...

willvarfar · on March 29, 2021

Super neat!

(Small detail: the first 100 hits for ceilings are all _interesting_ ceilings? Nothing like the kind of ceilings usually encountered on a video chat app?)

caslon · on March 29, 2021

Why would anyone post something boring on a photography site? There's a bar for quality on flickr; it's not twitter.

mcv · on March 29, 2021

Yeah, flickr might not be a good data set to train for boring images. Maybe we need a high quality repository of bad photos.

johnmoberg · on March 29, 2021

Of course, the point is that it may not be good data for training a model.

caslon · on March 29, 2021

I was clarifying, since the above poster was using question marks.

sanxiyn · on March 29, 2021

I found that people don't upload misdirected ceiling photos and videos to YouTube and Flickr. How unfortunate. Yes, it's better than nothing.

sillysaurusx · on March 29, 2021

Now you've made me want to create a random-ceilings dataset. I spent about an hour trying to think of some way to get you a bunch of un-curated ceilings, to no avail.

If you ended up using the ceilings uploaded to your app directly from your users, it might be possible that there wasn't any better solution than the one you came up with.

tdeck · on March 29, 2021

Perhaps if there's a database of those 360 photo globes you could crop out the upward-pointing part?

IAmEveryone · on March 29, 2021

That dataset will work well for detecting nudity in Zoom calls from the Sistine Chapel :)

teruakohatu · on March 29, 2021

Before I download 54gb from your server, can you tell me the structure of the json?

What is a good tool to search it without 64gb of free memory?

sillysaurusx · on March 29, 2021

My janky-image-search script looks like this:

  tag="${1}"
  shift 1
  curl -fsSL https://battle.shawwn.com/sdc/f100m/yfcc100m_dataset.json | jq '{user_tags, machine_tags, description, title, item_download_url, item_url}' -c | egrep -i "\"($tag)\"" "$@"

Basically, it trades bandwidth for memory. And hetzner has free bandwidth.

Here's curl -fsSL https://battle.shawwn.com/sdc/f100m/yfcc100m_dataset.json | head -n 1, which should give you all the info about the structure. Or most of it.

  {"date_taken": "2013-08-10 11:05:54.0", "item_farm_identifier": 3, "item_url": "http://www.flickr.com/photos/9315487@N04/9497940823/", "user_tags": [], "title": "IMG_7216", "item_extension_original": "jpg", "latitude": null, "user_nickname": "meowelk", "date_uploaded": 1376370010, "accuracy": null, "machine_tags": [], "item_server_identifier": 2856, "description": null, "item_secret_original": "15f4257122", "user_nsid": "9315487@N04", "license_name": "Attribution-NonCommercial-ShareAlike License", "capture_device": "Canon EOS 7D", "item_marker": false, "item_id": 9497940823, "longitude": null, "item_download_url": "http://farm3.staticflickr.com/2856/9497940823_0c0d854111.jpg", "license_url": "http://creativecommons.org/licenses/by-nc-sa/2.0/", "item_secret": "0c0d854111"}

I just dump those to disk and then use `jq` to extract the item_download_url.

FWIW, the full ceiling search finished with 9,325 results. Here's the first 1k.

https://cdn.gather.town/storage.googleapis.com/gather-town.a...

teruakohatu · on March 29, 2021

Thanks for the information. So the 54 gigabytes is just plain text JSON? How big is the full image dataset?

sillysaurusx · on March 30, 2021

Good question. It's 100m images, but I haven't tried to find how large all of it is.

We downloaded a subset of 3 million images, which is apparently 379.77 GiB. So a linear extrapolation would be (100/3*379.77) = ~12,659 GiB for the full 100m images.

12TB really isn't too bad. It's massive, yes, but imagenet 21k is 1.2TB.

joncrane · on March 29, 2021

Wow I immediately recognized that first one as the ceiling of a Washington DC Metro station.

nxpnsv · on March 29, 2021

I don’t need it, but that’s really awesome of you!