Little crafty trick: grab the flickr dataset (yfcc100m) and snag photos tagged as whatever you want. It's like a janky google image search. I've put together datasets of airplanes, bicycles, dogs, etc this way.
It's not entirely accurate, but it's good enough. Within a few hours you can have a pretty large dataset of whatever you want, really. (Yay for massive dataset plus user tags.)
I have a script called "janky-image-search" that returns random results from searching this dataset. Here are a few random hits for `janky-image-search ceiling`:
But as you can see, it's not effortless. Most of those ceilings are old. So it depends what you want. It's why labeled data is worth billions of dollars (scale.ai et al).
(Small detail: the first 100 hits for ceilings are all _interesting_ ceilings? Nothing like the kind of ceilings usually encountered on a video chat app?)
Now you've made me want to create a random-ceilings dataset. I spent about an hour trying to think of some way to get you a bunch of un-curated ceilings, to no avail.
If you ended up using the ceilings uploaded to your app directly from your users, it might be possible that there wasn't any better solution than the one you came up with.
Good question. It's 100m images, but I haven't tried to find how large all of it is.
We downloaded a subset of 3 million images, which is apparently 379.77 GiB. So a linear extrapolation would be (100/3*379.77) = ~12,659 GiB for the full 100m images.
12TB really isn't too bad. It's massive, yes, but imagenet 21k is 1.2TB.
It's not entirely accurate, but it's good enough. Within a few hours you can have a pretty large dataset of whatever you want, really. (Yay for massive dataset plus user tags.)
You can snag a copy from my server, if you want. (Warning: It's a direct link to a 54GB json file.) https://battle.shawwn.com/sdc/f100m/yfcc100m_dataset.json
I have a script called "janky-image-search" that returns random results from searching this dataset. Here are a few random hits for `janky-image-search ceiling`:
https://www.flickr.com/photos/35981213@N00/9374577304
https://farm3.staticflickr.com/2084/2513539756_7a768de44c.jp...
http://www.flickr.com/photos/96453841@N00/7077625757/
https://www.flickr.com/photos/28742299@N04/4683245369/
etc.
The quality is hit or miss, but it seems better than 70%.
EDIT: Here's a gallery of the first 100 hits for "ceiling": https://cdn.gather.town/storage.googleapis.com/gather-town.a...
But as you can see, it's not effortless. Most of those ceilings are old. So it depends what you want. It's why labeled data is worth billions of dollars (scale.ai et al).