I'd say Dall-E 2 is a little more unified - they do have multiple networks, but ...

astrange · on April 7, 2022

VQGAN (being a "GAN") is already two networks - one Generates things, and the other is Adversarial and judges if the other network is good enough, then you train them both at once and they fight.

CLIP+VQGAN generation IIRC works by replacing the adversarial network with CLIP, so it understands text prompts, then retraining it for a while towards the prompted target, then generating whatever it's learned from that.

GANs are a silly idea that shouldn't work but somehow do. There's some attempts to replace the idea: https://www.microsoft.com/en-us/research/blog/unlocking-new-...

Imnimo · on April 7, 2022

I think that in CLIP+VQGAN, the VQGAN model is frozen, and what you do is start from a random latent code, generate an image, pass it to CLIP, and the backprop through CLIP and through the VQGAN generator to figure out how you should move the latent code to make it better match the prompt. Then you just keep taking gradient ascent steps to find better and better latent codes. So it's like 'retraining', except you're 'training' the network input rather than the network weights.

duxup · on April 6, 2022

Got it, thanks.

Makes sense to me as far as avoiding a sort of maximized sunset that is always there and is SUNSET rather than a nice sunset... but also avoiding watering it down and getting a way too subtle sunset.

It's not AI but I've been watching some folks solving / trying to solve some routing (vehicles) problems and you get the "this looks like it was maximized for X" kind of solution but that's maybe not what is important / customer perception is unpredictable. I kinda want to just come up with 3 solutions and let someone randomly click .... in fact i see some software do that at times.

Imnimo · on April 6, 2022

Yeah, I think the trick is that when you ask for "a picture of a sunset", you're really asking for "a picture of a sunset that looks like a realistic natural image and obeys the laws of reality and is consistent with all of the other tacit expectations a human has for an image". And so if you just go all in on "a picture of a sunset", you often end up with what a human would describe as "a picture of what an AI thinks a sunset is".