Hacker News new | past | comments | ask | show | jobs | submit login

I'd say Dall-E 2 is a little more unified - they do have multiple networks, but they're trained to work together. The previous approaches I was talking about are a lot more like the microservices analogy. Someone published a model (called CLIP) that can say "how much does this image look like a sunset". Someone else published a totally different model (e.g. VQGAN) that can generate images (but with no way to provide text prompts). A third person figures out a clever way to link the two up - have the VQGAN make an image, ask CLIP how much it looks like a sunset, and use backpropagation to adjust the image a little, repeat until you have a sunset. Each component is it's own thing, and VQGAN and CLIP don't know anything about one another.



VQGAN (being a "GAN") is already two networks - one Generates things, and the other is Adversarial and judges if the other network is good enough, then you train them both at once and they fight.

CLIP+VQGAN generation IIRC works by replacing the adversarial network with CLIP, so it understands text prompts, then retraining it for a while towards the prompted target, then generating whatever it's learned from that.

GANs are a silly idea that shouldn't work but somehow do. There's some attempts to replace the idea: https://www.microsoft.com/en-us/research/blog/unlocking-new-...


I think that in CLIP+VQGAN, the VQGAN model is frozen, and what you do is start from a random latent code, generate an image, pass it to CLIP, and the backprop through CLIP and through the VQGAN generator to figure out how you should move the latent code to make it better match the prompt. Then you just keep taking gradient ascent steps to find better and better latent codes. So it's like 'retraining', except you're 'training' the network input rather than the network weights.


Got it, thanks.

Makes sense to me as far as avoiding a sort of maximized sunset that is always there and is SUNSET rather than a nice sunset... but also avoiding watering it down and getting a way too subtle sunset.

It's not AI but I've been watching some folks solving / trying to solve some routing (vehicles) problems and you get the "this looks like it was maximized for X" kind of solution but that's maybe not what is important / customer perception is unpredictable. I kinda want to just come up with 3 solutions and let someone randomly click .... in fact i see some software do that at times.


Yeah, I think the trick is that when you ask for "a picture of a sunset", you're really asking for "a picture of a sunset that looks like a realistic natural image and obeys the laws of reality and is consistent with all of the other tacit expectations a human has for an image". And so if you just go all in on "a picture of a sunset", you often end up with what a human would describe as "a picture of what an AI thinks a sunset is".




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: