https://cdn.openai.com/papers/dall-e-2.pdf > Given an image x, we can obtain its...

Jack000 · on April 6, 2022

The first quote is talking about prompting the model with images instead of text. The second quote is using "mix up" in the sense that the model is confused about the prompt, not that it mixes up existing images.

ML models can output training data verbatim if they over-fit, but a well trained model does extrapolate to novel inputs. You could say that this model doesn't know that images are 2d representations of a larger 3d universe, but now we have NERF which kind of obsoletes this objection as well.

recuter · on April 6, 2022

The model is "confused about the prompt" because it has no concept of a scene or of (some sort of) reality.

If we task "Kuala dunking basketball" to a human and present them with two images, one of a Kuala climbing a tree and another of a basketball player dunking - the human would cut out the foreground (Human, Kuala) from the background (basketball court, forest) and swap them places easily.

The laborious part would be to match the shadows and angles in the new image. This requires skill and effort.

Dall-E would conjure up an entirely novel image from scratch, dodging this bit. It blended the concepts instead, great.

But it does not understand what a basketball court actually is, or why the Kuala would reflect in a puddle. Or why and how this new Kuala might look different in these circumstances from previous examples of Kualas that it knows about.

The human dunker and the kuala dunker are not truly interchangeable. :)

andybak · on April 6, 2022

I'm not sure that's "compositing" except in the most abstract sense? But maybe that's the sense in which you mean it.

I'd argue that at no point is there a representation of a "teddy bear" and "a background" that map closely to their visual representation - that are combined.

(I'm aware I'm being imprecise so give me some leeway here)

astrange · on April 7, 2022

This model's predecessor could do image editing with some help:

https://arxiv.org/pdf/2112.10741.pdf

so it could distinguish individual objects from backgrounds. Other ML models can definitely do that; it's called "panoptic segmentation".

recuter · on April 7, 2022

Thank you! Fascinating, I didn't know about panoptic segmentation - that makes things much more interesting.

It really needs to expose the whole pipeline to become truly useful.