Yeah, I mean you're right that ultimately the proof is in the pudding.
But I do think we could have guessed that this sort of approach would be better (at least at a high level - I'm not claiming I could have predicted all the technical details!). The previous approaches were sort of the best that people could do without access to the training data and resources - you had a pretrained CLIP encoder that could tell you how well a text caption and an image matched, and you had a pretrained image generator (GAN, diffusion model, whatever), and it was just a matter of trying to force the generator to output something that CLIP thought looked like the caption. You'd basically do gradient ascent to make the image look more and more and more like the text prompt (all the while trying to balance the need to still look like a realistic image). Just from an algorithm aesthetics perspective, it was very much a duct tape and chicken wire approach.
The analogy I would give is if you gave a three-year-old some paints, and they made an image and showed it to you, and you had to say, "this looks like a little like a sunset" or "this looks a lot like a sunset". They would keep going back and adjusting their painting, and you'd keep giving feedback, and eventually you'd get something that looks like a sunset. But it'd be better, if you could manage it, to just teach the three-year-old how to paint, rather than have this brute force process.
Obviously the real challenge here is "well how do you teach a three-year-old how to paint?" - and I think you're right that that question still has a lot of alchemy to it.
I gotta be missing something here, because wasn’t “teaching a three year old to paint” (where the three year old is DALLE) the original objective in the first place? So if we’ve reduced the problem to that, it seems we’re back where we started. What’s the difference?
I meant to say that Dall-E 2's approach is closer to "teaching a three year old to paint" than the alternative methods. Instead of trying to maximize agreement to a text embedding like other methods, Dall-E 2 first predicts an image embedding (very roughly analogous to envisioning what you're going to draw before you start laying down paint), and then the decoder knows how to go from an embedding to an image (very roughly analogous to "knowing how to paint"). This is in contrast to approaches which operate by repeatedly querying "does this look like the text prompt?" as they refine the image (roughly analogous to not really knowing how to paint, but having a critic who tells you if you're getting warmer or colder).
Well, original DALL-E also worked this way. The reason the open source models use searches is that OpenAI didn't release DALL-E, but only another project called CLIP they used to sort DALL-E output by quality. It turns out CLIP could be adapted to produce images too if you used it to drive a GAN.
There is a DALL-E model available now from another company and you can use it directly (mini-DALLE or ruDALL-E), but its vocabulary is small and it can't do faces for privacy reasons.
I don't think it is actually painting at all but I need to read the paper carefully.
I think it is using a free text query to select the best possible clipart from a big library and blends it together. Still very interesting and useful.
It would be extremely impressive if the "Kuala dunking a basketball" had a puddle on the court in which it was reflected correctly, that would be mind blowing.
This is actual image generation - the 'decoder' takes as input a latent code (representing the encoding of the text query), and synthesizes an image. It's not compositing or querying a reference library. The only time that real images enter the process is during training - after that, it's just the network weights.
It is compositing as final step. I understand that the Kuala it is compositing may have been a previously un-existent Kuala that it synthesized from a library of previously tagged Kuala images... that's cool, but what is the difference really from just plucking one of the pre-existing Kualas into the scene?
The difference is just that it makes the compositing easier. If you don't have a pre-existing image that would match the shadows and angles you can hallucinate a new Kuala that does. Neat trick.
But I bet if I threw the poor marsupial at a basket net it would look really differently than the original clipart of it climbing some tree in a slow and relaxed manner. See what I mean?
Maybe Dall-E 2 can make it strike a new pose. The limb positions could be altered. But the facial expression?
And if the basketball background has wind blowing leaves in one direction the Kuala fur won't match, it will look like the training set fur. The puddle won't reflect it. 'etc.
This thing doesn't understand what a Kuala is like a 3-yr old. It understands the text "Kuala" is associated with that tagged collection of pixel blobs and can conjure up similar blobs unto new backgrounds - but it can't paint me a new type of Kuala that it hasn't seen before. It just looks that way.
>And if the basketball background has wind blowing leaves in one direction the Kuala fur won't match, it will look like the training set fur. The puddle won't reflect it.
If you read the article, it gives examples that do exactly this. For example, adding a flamingo shows the flamingo reflected in a pool. Adding a corgi at different locations in a photo of an art gallery shows it in picture style when it's added to a picture, then in photorealistic style when it's on the ground.
Well not so much an article as really interesting hand picked examples. The paper doesn't address this as far as I can tell. My guess is that this is a weak point that will trip it up occasionally.
A lot of the time it doesn't super matter, but sometimes it does.
I might be misinterpeting your use of "compositing" here (and my own technical knowledge is fairly shallow) but I don't think there's any compositing of elements generally in AI image generation. (unless Dall-E 2 changes this. I haven't read the paper yet)
> Given an image x, we can obtain its CLIP image embedding zi and then use our decoder to “invert” zi, producing new images that we call variations of our input.
..
It is also possible to combine two images for variations. To do so, we perform spherical interpolation of their CLIP embeddings zi and zj to obtain intermediate zθ = slerp(zi, zj , θ), and produce variations of zθ by passing it through the decoder.
From the limitations section:
> We find that the reconstructions mix up objects and attributes.
The first quote is talking about prompting the model with images instead of text. The second quote is using "mix up" in the sense that the model is confused about the prompt, not that it mixes up existing images.
ML models can output training data verbatim if they over-fit, but a well trained model does extrapolate to novel inputs. You could say that this model doesn't know that images are 2d representations of a larger 3d universe, but now we have NERF which kind of obsoletes this objection as well.
The model is "confused about the prompt" because it has no concept of a scene or of (some sort of) reality.
If we task "Kuala dunking basketball" to a human and present them with two images, one of a Kuala climbing a tree and another of a basketball player dunking - the human would cut out the foreground (Human, Kuala) from the background (basketball court, forest) and swap them places easily.
The laborious part would be to match the shadows and angles in the new image. This requires skill and effort.
Dall-E would conjure up an entirely novel image from scratch, dodging this bit. It blended the concepts instead, great.
But it does not understand what a basketball court actually is, or why the Kuala would reflect in a puddle. Or why and how this new Kuala might look different in these circumstances from previous examples of Kualas that it knows about.
The human dunker and the kuala dunker are not truly interchangeable. :)
I'm not sure that's "compositing" except in the most abstract sense? But maybe that's the sense in which you mean it.
I'd argue that at no point is there a representation of a "teddy bear" and "a background" that map closely to their visual representation - that are combined.
(I'm aware I'm being imprecise so give me some leeway here)
But I do think we could have guessed that this sort of approach would be better (at least at a high level - I'm not claiming I could have predicted all the technical details!). The previous approaches were sort of the best that people could do without access to the training data and resources - you had a pretrained CLIP encoder that could tell you how well a text caption and an image matched, and you had a pretrained image generator (GAN, diffusion model, whatever), and it was just a matter of trying to force the generator to output something that CLIP thought looked like the caption. You'd basically do gradient ascent to make the image look more and more and more like the text prompt (all the while trying to balance the need to still look like a realistic image). Just from an algorithm aesthetics perspective, it was very much a duct tape and chicken wire approach.
The analogy I would give is if you gave a three-year-old some paints, and they made an image and showed it to you, and you had to say, "this looks like a little like a sunset" or "this looks a lot like a sunset". They would keep going back and adjusting their painting, and you'd keep giving feedback, and eventually you'd get something that looks like a sunset. But it'd be better, if you could manage it, to just teach the three-year-old how to paint, rather than have this brute force process.
Obviously the real challenge here is "well how do you teach a three-year-old how to paint?" - and I think you're right that that question still has a lot of alchemy to it.