The underlying data, I gather, must associate an image with a description. There...

The underlying data, I gather, must associate an image with a description. There aren’t millions of well-written descriptions of images. Yet, they’ve accomplished this with, I think, with merely the sparse text that comprises the title of the scraped JPEG or perhaps the surrounding text on a website.

However, shortly, there will be millions of extremely detailed descriptions of new images… that is, a person puts in a detailed prompt, and if you can capture how pleased the user is with the result, you could then add that new picture — along with the user’s prompt — to your training data. Eventually, you will have millions of well-described images, which will make the system even more amazing.