Here is my extremely rough ELI-15. It uses some building blocks like "train a neural network", which probably warrant explanations of their own.
The system consists of a few components. First, CLIP. CLIP is essentially a pair of neural networks, one is a 'text encoder', and the other is an 'image encoder'. CLIP is trained on a giant corpus of images and corresponding captions. The image encoder takes as input an image, and spits out a numerical description of that image (called an 'encoding' or 'embedding'). The text encoder takes as input a caption and does the same. The networks are trained so that the encodings for a corresponding caption/image pair are close to each other. CLIP allows us to ask "does this image match this caption?"
The second part is an image generator. This is another neural network, which takes as input an encoding, and produces an image. Its goal is to be the reverse of the CLIP image encoder (they call it unCLIP). The way it works is pretty complicated. It uses a process called 'diffusion'. Imagine you started with a real image, and slowly repeatedly added noise to it, step by step. Eventually, you'd end up with an image that is pure noise. The goal of a diffusion model is to learn the reverse process - given a noisy image, produce a slightly less noisy one, until eventually you end up with a clean, realistic image. This is a funny way to do things, but it turns out to have some advantages. One advantage is that it allows the system to build up the image step by step, starting from the large scale structure and only filling in the fine details at the end. If you watch the video on their blog post, you can see this diffusion process in action. It's not just a special effect for the video - they're literally showing the system process for creating an image starting from noise. The mathematical details of how to train a diffusion system are very complicated.
The third is a "prior" (a confusing name). Its job is to take the encoding of a text prompt, and predict the encoding of the corresponding image. You might think that this is silly - CLIP was supposed to make the encodings of the caption and the image match! But the space of images and captions is not so simple - there are many images for a given caption, and many captions for a given image. I think of the "prior" as being responsible for picking which picture of "a teddy bear on a skateboard" we're going to draw, but this is a loose analogy.
So, now it's time to make an image. We take the prompt, and ask CLIP to encode it. We give the CLIP encoding to the prior, and it predicts for us an image encoding. Then we give the image encoding to the diffusion model, and it produces an image. This is, obviously, over-simplified, but this captures the process at a high level.
Why does it work so well? A few reasons. First, CLIP is really good at its job. OpenAI scraped a colossal dataset of image/caption pairs, spent a huge amount of compute training it, and come up with a lot of clever training schemes to make it work. Second, diffusion models are really good at making realistic images - previous works have used GAN models that try to generate a whole image in one go. Some GANs are quite good, but so far diffusion seems to be better at generating images that match a prompt. The value of the image generator is that it helps constrain your output to be a realistic image. We could have just optimized raw pixels until we get something CLIP thinks looks like the prompt, but it would likely not be a natural image.
To generate an image from a prompt, DALL-E 2 works as follows. First, ask CLIP to encode your prompt. Next, ask the prior what it thinks a good image encoding would be for that encoded prompt. Then ask the generator to draw that image encoding. Easy peasy!
Any pointers on getting up to speed on diffusion models? I haven't encountered them in my corner of the ML world, and googling around for a review paper didn't turn anything up.
This paper is a decent starting point on the literature side, but it's a doozy.
Both the paper and blog post are pretty math heavy. I have not yet found a really clear intuitive explanation that doesn't get down in the weeds of the math, and it took me a long time to understand what the hell the math is trying to say (and there are some parts I still don't fully understand!)
The system consists of a few components. First, CLIP. CLIP is essentially a pair of neural networks, one is a 'text encoder', and the other is an 'image encoder'. CLIP is trained on a giant corpus of images and corresponding captions. The image encoder takes as input an image, and spits out a numerical description of that image (called an 'encoding' or 'embedding'). The text encoder takes as input a caption and does the same. The networks are trained so that the encodings for a corresponding caption/image pair are close to each other. CLIP allows us to ask "does this image match this caption?"
The second part is an image generator. This is another neural network, which takes as input an encoding, and produces an image. Its goal is to be the reverse of the CLIP image encoder (they call it unCLIP). The way it works is pretty complicated. It uses a process called 'diffusion'. Imagine you started with a real image, and slowly repeatedly added noise to it, step by step. Eventually, you'd end up with an image that is pure noise. The goal of a diffusion model is to learn the reverse process - given a noisy image, produce a slightly less noisy one, until eventually you end up with a clean, realistic image. This is a funny way to do things, but it turns out to have some advantages. One advantage is that it allows the system to build up the image step by step, starting from the large scale structure and only filling in the fine details at the end. If you watch the video on their blog post, you can see this diffusion process in action. It's not just a special effect for the video - they're literally showing the system process for creating an image starting from noise. The mathematical details of how to train a diffusion system are very complicated.
The third is a "prior" (a confusing name). Its job is to take the encoding of a text prompt, and predict the encoding of the corresponding image. You might think that this is silly - CLIP was supposed to make the encodings of the caption and the image match! But the space of images and captions is not so simple - there are many images for a given caption, and many captions for a given image. I think of the "prior" as being responsible for picking which picture of "a teddy bear on a skateboard" we're going to draw, but this is a loose analogy.
So, now it's time to make an image. We take the prompt, and ask CLIP to encode it. We give the CLIP encoding to the prior, and it predicts for us an image encoding. Then we give the image encoding to the diffusion model, and it produces an image. This is, obviously, over-simplified, but this captures the process at a high level.
Why does it work so well? A few reasons. First, CLIP is really good at its job. OpenAI scraped a colossal dataset of image/caption pairs, spent a huge amount of compute training it, and come up with a lot of clever training schemes to make it work. Second, diffusion models are really good at making realistic images - previous works have used GAN models that try to generate a whole image in one go. Some GANs are quite good, but so far diffusion seems to be better at generating images that match a prompt. The value of the image generator is that it helps constrain your output to be a realistic image. We could have just optimized raw pixels until we get something CLIP thinks looks like the prompt, but it would likely not be a natural image.
To generate an image from a prompt, DALL-E 2 works as follows. First, ask CLIP to encode your prompt. Next, ask the prior what it thinks a good image encoding would be for that encoded prompt. Then ask the generator to draw that image encoding. Easy peasy!