Hacker News new | past | comments | ask | show | jobs | submit login

https://cdn.openai.com/papers/dall-e-2.pdf

> Given an image x, we can obtain its CLIP image embedding zi and then use our decoder to “invert” zi, producing new images that we call variations of our input. .. It is also possible to combine two images for variations. To do so, we perform spherical interpolation of their CLIP embeddings zi and zj to obtain intermediate zθ = slerp(zi, zj , θ), and produce variations of zθ by passing it through the decoder.

From the limitations section:

> We find that the reconstructions mix up objects and attributes.




The first quote is talking about prompting the model with images instead of text. The second quote is using "mix up" in the sense that the model is confused about the prompt, not that it mixes up existing images.

ML models can output training data verbatim if they over-fit, but a well trained model does extrapolate to novel inputs. You could say that this model doesn't know that images are 2d representations of a larger 3d universe, but now we have NERF which kind of obsoletes this objection as well.


The model is "confused about the prompt" because it has no concept of a scene or of (some sort of) reality.

If we task "Kuala dunking basketball" to a human and present them with two images, one of a Kuala climbing a tree and another of a basketball player dunking - the human would cut out the foreground (Human, Kuala) from the background (basketball court, forest) and swap them places easily.

The laborious part would be to match the shadows and angles in the new image. This requires skill and effort.

Dall-E would conjure up an entirely novel image from scratch, dodging this bit. It blended the concepts instead, great.

But it does not understand what a basketball court actually is, or why the Kuala would reflect in a puddle. Or why and how this new Kuala might look different in these circumstances from previous examples of Kualas that it knows about.

The human dunker and the kuala dunker are not truly interchangeable. :)


I'm not sure that's "compositing" except in the most abstract sense? But maybe that's the sense in which you mean it.

I'd argue that at no point is there a representation of a "teddy bear" and "a background" that map closely to their visual representation - that are combined.

(I'm aware I'm being imprecise so give me some leeway here)


This model's predecessor could do image editing with some help:

https://arxiv.org/pdf/2112.10741.pdf

so it could distinguish individual objects from backgrounds. Other ML models can definitely do that; it's called "panoptic segmentation".


Thank you! Fascinating, I didn't know about panoptic segmentation - that makes things much more interesting.

It really needs to expose the whole pipeline to become truly useful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: