Hacker News new | past | comments | ask | show | jobs | submit login

What is the content of the training set like, that DALL·E 2 has all this info? Are there people out there spending thousands of hours just tagging or writing descriptions of training images?

And does DALL·E just receive the image as-is with its description, or does it get more info (e.g. "this part of the image is the dog")?




> What is the content of the training set like, that DALL·E 2 has all this info?

Web crawling. (https://arxiv.org/pdf/2103.00020.pdf 2.2)

> Are there people out there spending thousands of hours just tagging or writing descriptions of training images?

"Yes", Google Image Search, Wikipedia captions, Danbooru are all that in some sense.

The biases should be obvious if you think about it for a minute and CLIP has even more biases since they removed all kinds of NSFW content, etc.

> And does DALL·E just receive the image as-is with its description, or does it get more info (e.g. "this part of the image is the dog")?

Only that, but segmentation is a big research area too and it looks like FB has a reproduction with some of that mixed in.

https://arxiv.org/abs/2203.13131


Made a mistake and googled "Danbooru" at the office


Should have noticed your comment first; I did the same. Thankfully nothing too explicit in the initial page of search results (no images, etc)


gwern keeps renaming the dataset so I don't know what to call it!

https://www.gwern.net/Danbooru2021

It is a partially (highly) NSFW dataset though, which is probably the only way to get so much accurate volunteer tagging.


I don't rename it, I release a new updated dataset. However, each dataset is a superset of the previous one, so you can always reconstruct the old one and thus 'Danbooru2018', 'Danbooru2019', 'Danbooru2020', 'Danbooru2021' etc have exact well-defined meanings that never change, and if you don't happen to have a copy of Danbooru2018 sitting around, you can just download Danbooru2021, unpack the Danbooru2018 metadata from the archive tarball & delete the post-Danbooru2018 images, and you now have a bit-for-bit copy of Danbooru2018.

Also, if you want to just discuss the general concept of boorus (there are many beyond Danbooru), it'd probably be better to invoke Safebooru https://safebooru.org/ which is what it sounds like.


+ Flickr, a huge AI-friendly database of human-captioned images

And the DALLE architecture can also handle masking where you initialize only part of the latent space with noise and initialize other parts with a starting image. The video on their website shows examples of that to replace a pet on a chair.


> + Flickr, a huge AI-friendly database of human-captioned images

You're right, I forgot to mention them. Their metadata is great, and most importantly photos have their licenses tagged and many of them are CC0 (including mine).

Are content and content tags that great though? I don't content tag my own photos there, and when I've tried to comparison shop cameras all the popular images are over edited /r/shittyhdr art…

Metadata seems really interesting though. You'd think a visual search AI would want to know the white balance EXIF tags on a photo, so it knows if a yellow object is actually yellow or just under a streetlight.


Thank you for the info!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: