Yeah, I mean you're right that ultimately the proof is in the pudding. But I do ...

johnfn · on April 6, 2022

I gotta be missing something here, because wasn’t “teaching a three year old to paint” (where the three year old is DALLE) the original objective in the first place? So if we’ve reduced the problem to that, it seems we’re back where we started. What’s the difference?

Imnimo · on April 6, 2022

I meant to say that Dall-E 2's approach is closer to "teaching a three year old to paint" than the alternative methods. Instead of trying to maximize agreement to a text embedding like other methods, Dall-E 2 first predicts an image embedding (very roughly analogous to envisioning what you're going to draw before you start laying down paint), and then the decoder knows how to go from an embedding to an image (very roughly analogous to "knowing how to paint"). This is in contrast to approaches which operate by repeatedly querying "does this look like the text prompt?" as they refine the image (roughly analogous to not really knowing how to paint, but having a critic who tells you if you're getting warmer or colder).

astrange · on April 7, 2022

Well, original DALL-E also worked this way. The reason the open source models use searches is that OpenAI didn't release DALL-E, but only another project called CLIP they used to sort DALL-E output by quality. It turns out CLIP could be adapted to produce images too if you used it to drive a GAN.

There is a DALL-E model available now from another company and you can use it directly (mini-DALLE or ruDALL-E), but its vocabulary is small and it can't do faces for privacy reasons.

recuter · on April 6, 2022

I don't think it is actually painting at all but I need to read the paper carefully.

I think it is using a free text query to select the best possible clipart from a big library and blends it together. Still very interesting and useful.

It would be extremely impressive if the "Kuala dunking a basketball" had a puddle on the court in which it was reflected correctly, that would be mind blowing.

Imnimo · on April 6, 2022

This is actual image generation - the 'decoder' takes as input a latent code (representing the encoding of the text query), and synthesizes an image. It's not compositing or querying a reference library. The only time that real images enter the process is during training - after that, it's just the network weights.

recuter · on April 6, 2022

It is compositing as final step. I understand that the Kuala it is compositing may have been a previously un-existent Kuala that it synthesized from a library of previously tagged Kuala images... that's cool, but what is the difference really from just plucking one of the pre-existing Kualas into the scene?

The difference is just that it makes the compositing easier. If you don't have a pre-existing image that would match the shadows and angles you can hallucinate a new Kuala that does. Neat trick.

But I bet if I threw the poor marsupial at a basket net it would look really differently than the original clipart of it climbing some tree in a slow and relaxed manner. See what I mean?

Maybe Dall-E 2 can make it strike a new pose. The limb positions could be altered. But the facial expression?

And if the basketball background has wind blowing leaves in one direction the Kuala fur won't match, it will look like the training set fur. The puddle won't reflect it. 'etc.

This thing doesn't understand what a Kuala is like a 3-yr old. It understands the text "Kuala" is associated with that tagged collection of pixel blobs and can conjure up similar blobs unto new backgrounds - but it can't paint me a new type of Kuala that it hasn't seen before. It just looks that way.

dash2 · on April 6, 2022

>And if the basketball background has wind blowing leaves in one direction the Kuala fur won't match, it will look like the training set fur. The puddle won't reflect it.

If you read the article, it gives examples that do exactly this. For example, adding a flamingo shows the flamingo reflected in a pool. Adding a corgi at different locations in a photo of an art gallery shows it in picture style when it's added to a picture, then in photorealistic style when it's on the ground.

recuter · on April 6, 2022

Well not so much an article as really interesting hand picked examples. The paper doesn't address this as far as I can tell. My guess is that this is a weak point that will trip it up occasionally.

A lot of the time it doesn't super matter, but sometimes it does.

andybak · on April 6, 2022

> It is compositing as final step.

I might be misinterpeting your use of "compositing" here (and my own technical knowledge is fairly shallow) but I don't think there's any compositing of elements generally in AI image generation. (unless Dall-E 2 changes this. I haven't read the paper yet)

recuter · on April 6, 2022

https://cdn.openai.com/papers/dall-e-2.pdf

> Given an image x, we can obtain its CLIP image embedding zi and then use our decoder to “invert” zi, producing new images that we call variations of our input. .. It is also possible to combine two images for variations. To do so, we perform spherical interpolation of their CLIP embeddings zi and zj to obtain intermediate zθ = slerp(zi, zj , θ), and produce variations of zθ by passing it through the decoder.

From the limitations section:

> We find that the reconstructions mix up objects and attributes.

Jack000 · on April 6, 2022

The first quote is talking about prompting the model with images instead of text. The second quote is using "mix up" in the sense that the model is confused about the prompt, not that it mixes up existing images.

ML models can output training data verbatim if they over-fit, but a well trained model does extrapolate to novel inputs. You could say that this model doesn't know that images are 2d representations of a larger 3d universe, but now we have NERF which kind of obsoletes this objection as well.

recuter · on April 6, 2022

The model is "confused about the prompt" because it has no concept of a scene or of (some sort of) reality.

If we task "Kuala dunking basketball" to a human and present them with two images, one of a Kuala climbing a tree and another of a basketball player dunking - the human would cut out the foreground (Human, Kuala) from the background (basketball court, forest) and swap them places easily.

The laborious part would be to match the shadows and angles in the new image. This requires skill and effort.

Dall-E would conjure up an entirely novel image from scratch, dodging this bit. It blended the concepts instead, great.

But it does not understand what a basketball court actually is, or why the Kuala would reflect in a puddle. Or why and how this new Kuala might look different in these circumstances from previous examples of Kualas that it knows about.

The human dunker and the kuala dunker are not truly interchangeable. :)

andybak · on April 6, 2022

I'm not sure that's "compositing" except in the most abstract sense? But maybe that's the sense in which you mean it.

I'd argue that at no point is there a representation of a "teddy bear" and "a background" that map closely to their visual representation - that are combined.

(I'm aware I'm being imprecise so give me some leeway here)

astrange · on April 7, 2022

This model's predecessor could do image editing with some help:

https://arxiv.org/pdf/2112.10741.pdf

so it could distinguish individual objects from backgrounds. Other ML models can definitely do that; it's called "panoptic segmentation".

recuter · on April 7, 2022

Thank you! Fascinating, I didn't know about panoptic segmentation - that makes things much more interesting.

It really needs to expose the whole pipeline to become truly useful.