What is the content of the training set like, that DALL·E 2 has all this info? Are there people out there spending thousands of hours just tagging or writing descriptions of training images?
And does DALL·E just receive the image as-is with its description, or does it get more info (e.g. "this part of the image is the dog")?
I don't rename it, I release a new updated dataset. However, each dataset is a superset of the previous one, so you can always reconstruct the old one and thus 'Danbooru2018', 'Danbooru2019', 'Danbooru2020', 'Danbooru2021' etc have exact well-defined meanings that never change, and if you don't happen to have a copy of Danbooru2018 sitting around, you can just download Danbooru2021, unpack the Danbooru2018 metadata from the archive tarball & delete the post-Danbooru2018 images, and you now have a bit-for-bit copy of Danbooru2018.
Also, if you want to just discuss the general concept of boorus (there are many beyond Danbooru), it'd probably be better to invoke Safebooru https://safebooru.org/ which is what it sounds like.
+ Flickr, a huge AI-friendly database of human-captioned images
And the DALLE architecture can also handle masking where you initialize only part of the latent space with noise and initialize other parts with a starting image. The video on their website shows examples of that to replace a pet on a chair.
> + Flickr, a huge AI-friendly database of human-captioned images
You're right, I forgot to mention them. Their metadata is great, and most importantly photos have their licenses tagged and many of them are CC0 (including mine).
Are content and content tags that great though? I don't content tag my own photos there, and when I've tried to comparison shop cameras all the popular images are over edited /r/shittyhdr art…
Metadata seems really interesting though. You'd think a visual search AI would want to know the white balance EXIF tags on a photo, so it knows if a yellow object is actually yellow or just under a streetlight.
And does DALL·E just receive the image as-is with its description, or does it get more info (e.g. "this part of the image is the dog")?