Yeah, my comment didn't really do a good job of making clear that distinction. Obviously the details are pretty technical, but maybe I can give a high-level explanation.
The previous systems I was talking about work something like this: "Try to find me the image the looks like it most matches 'a picture of a sunset'. Do this by repeatedly updating your image to make it look more and more like a sunset." Well, what looks more like a sunset? Two sunsets! Three sunsets! But this is not normally the way images are produced - if you hire an artist to make you a picture of a bear, they don't endeavor to create the most "bear" image possible.
Instead, what an artist might do is envision a bear in their head (this is loosely the job of the 'prior' - a name I agree is confusing), and then draw that particular bear image.
But why is this any different? Who cares if the vector I'm trying to draw is a 'text encoding' or an 'image encoding'? Like you say, it's all just vectors.
Take this answer with a big grain of salt, because this is just my personal intuitive understanding, but here's what I think: These encodings are produced by CLIP. CLIP has a text encoder and an image encoder. During training, you give it a text caption and a corresponding image, it encodes both, and tries to make the two encodings close. But there are many images which might accompany the caption "a picture of a bear". And conversely there are many captions which might accompany any given picture.
So the text encoding of "a picture of a bear" isn't really a good target - it sort of represents an amalgamation of all the possible bear pictures. It's better to pick one bear picture (i.e. generate one image embedding that we think matches the text embedding), and then just to try to draw that. Doing it this way, we aren't just trying to find the maximum bear picture - which probably doesn't even look like a realistic natural image.
Like I said, this is just my personal intuition, and may very well be a load of crap.
A bit more detail is that CLIP isn't designed to directly solve "is this a bear" aka "does this image match 'bear'". It's designed to do comparisons, like "which of images A and B is more like 'bear'". So it doesn't have a concept of absolute bear-ness.
OpenAI had no idea it could be used to generate images itself, which is why they left in issues like how it thinks an apple and the word "apple" written on a piece of paper are the same thing. Probably wouldn't have released it if they did know.
The previous systems I was talking about work something like this: "Try to find me the image the looks like it most matches 'a picture of a sunset'. Do this by repeatedly updating your image to make it look more and more like a sunset." Well, what looks more like a sunset? Two sunsets! Three sunsets! But this is not normally the way images are produced - if you hire an artist to make you a picture of a bear, they don't endeavor to create the most "bear" image possible.
Instead, what an artist might do is envision a bear in their head (this is loosely the job of the 'prior' - a name I agree is confusing), and then draw that particular bear image.
But why is this any different? Who cares if the vector I'm trying to draw is a 'text encoding' or an 'image encoding'? Like you say, it's all just vectors. Take this answer with a big grain of salt, because this is just my personal intuitive understanding, but here's what I think: These encodings are produced by CLIP. CLIP has a text encoder and an image encoder. During training, you give it a text caption and a corresponding image, it encodes both, and tries to make the two encodings close. But there are many images which might accompany the caption "a picture of a bear". And conversely there are many captions which might accompany any given picture.
So the text encoding of "a picture of a bear" isn't really a good target - it sort of represents an amalgamation of all the possible bear pictures. It's better to pick one bear picture (i.e. generate one image embedding that we think matches the text embedding), and then just to try to draw that. Doing it this way, we aren't just trying to find the maximum bear picture - which probably doesn't even look like a realistic natural image.
Like I said, this is just my personal intuition, and may very well be a load of crap.