This is exactly what they demo - they lock a scene and add a flamingo in three different locations. In another one they lock the scene and add a corgi.
- Select from X variations the new image that looks best to you
- It does the equivalent of a google image search on your "flamingo" prompt
- It picks the most blend-able ones as a basis to a new synthetic flamingo
- It superimposes the result on your image
Very cool don't get me wrong. Now I want to tweak this new floating flamingo I picked further, or have that Corgi in the museum maybe sink into the little couch a bit as it has weight in the real world.
Can't. You'd have to start over with the prompt or use this as the new base image maybe.
The example with furniture placement in an empty room is also very interesting. You could describe the kind of couch you want and where you want it and it will throw you decent options.
But say I want the purple one in the middle of the room that it gave me as an option, but rotated a little bit. It would generate a completely new purple couch. Maybe it will even look pretty similar but not exactly the same.
That's not how this works. There is no 'search' step, there is no 'superimposing' step. It's not really possible to explain what the AI is doing using these concepts.
If you pay attention to all the corgi examples, the sofa texture changes in each of them, and it synthesizes shadows in the right orientation - that's what it's trained to do. The first one actually does give you the impression of weight. And if you look at "A bowl of soup that looks like a monster knitted out of wool" the bowl is clearly weighing down. I bet if the picture had a more fluffy sofa you would indeed see the corgi making an indent on it, as it will have learned that from its training set.
Of course there will be limits to how much you can edit, but then nothing stops you from pulling that into Photoshop for extra fine adjustments of your own. This is far from a 'cool trick' and many of those images would take hours for a human to reproduce, especially with complex textures like the Teddy Bear ones. And note how they also have consistent specular reflections in all the glass materials.
How do you propose we talk about what it is doing if not by using the terminology from the human editing process it is replacing? I'm struggling to express things.
My issue is that it appears to not be possible to explain what the AI is doing at all. If you could, you'd be able to actually control the output. And talking about how the model is trained is interesting but not an answer.
Of course there is a superimposing step, that just means it adds its layer on top of the photo you provide. That's all it means and that's literally what it is doing, that's all I tried to say, heh.
> If you pay attention to all the corgi examples, the sofa texture changes in each of them
Yes, exactly!
> This is far from a 'cool trick' and many of those images would take hours for a human to reproduce
OK, fair enough. I'll try to be more clear:
It is very cool and not a trick and the results are fantastic if you got out exactly what you wanted. Amazing time saver. And if not? Right now this is totally hit or miss.
It would also take hours for a human to reproduce a Vermeer and this no doubt has those in its training set and would style-transfer unto a corgi instantly. Certainly faster than Vermeer himself could do it.
But Vermeer could explain how he came up with the style, his techniques, choices, 'etc.
It reads like the advance here is that it will usually synthesize something that looks great but not always the thing that you want. With no recourse.
> Of course there is a superimposing step, that just means it adds its layer on top of the photo you provide. That's all it means and that's literally what it is doing, that's all I tried to say, heh.
It is not doing this. You are wrong. You are mistaken. You are confused. You do not understand what is happening.
(People have tried to tell you this several times, but you're not listening. shrug One more can't hurt.)
I am specifically referring to the flamingo example: "DALL·E 2 can make realistic edits to existing images from a natural language caption."
You provide the background image and a text prompt and it doodles on top of the image you provided as per their demonstration. I wasn't referring to the other examples down the page where it conjures up a brand new image from scratch based on your image input.
It is great that you can tell it to add a flamingo and it fits into the background you provide nicely due to the well tuned style transfer. That part is cool. And it is impressive that sometimes the flamingo it adds is reflected in the water. But sometimes it isn't reflected. And it isn't up to you, it is up to it. And you can't tell it to add a reflection as a discrete step.
Look more carefully. This is more akin to a clipart finder, except if the clipart doesn't exist it uses the most similar thing in its training set to what it guesses you want as a starting point to synthesize new clipart from.
It doesn't add it in like an artist would and you can't control it at all. I don't know how to better express this.
This isn't unimpressive or un-useful but not quite as mind blowing on second glance.
Or am I in denial about how impressive this all really is by reading something slightly different into the static hand selected examples openai teased us with? :)
I'm sure two more papers down the line this thing will do what the true believers are convinced it already does perfectly much more seamlessly if they solve for my new favorite term, panoptic segmentation.
The link was for analogy, like religious people who can't accept science still try to find "gaps" where science can't explain something so they can imply God is doing it.
> But Vermeer could explain how he came up with the style, his techniques, choices, 'etc.
Often they can't. Ramanujan couldn't explain how he solved math problems, for instance, and humans can forget their own history easily, or even forget how to do something consciously while still doing it through muscle memory.
An ML model wouldn't forget the same way, but it could just lie to you.
Being opaque to human understanding is one of the downsides of existing AI/ML tech, for sure. Check out to the video in the page, and notice how the images transition from random color blobs to increasing detail - that's showing you how the image is being generated. It's a continuous process of trying to satisfy a prediction, there are no discrete editing steps.
The kind of tech you're imagining, where the computer has semantic understanding of what's in the picture, and is reproducing something based on a 3D scene, knowledge of physics, materials, etc is probably decades away. In that sense yes, this is just a 'trick'.
The "edit" capability, as far as I can tell please correct me if I got confused, is picking your favorite out of the generated variations.
I would like to "lock" the scene and add instructions like "throw in a reflection".