What do you suppose the mechanism is for the Charlie and the Chocolate Factory image having a golden ticket held aloft by somebody’s right hand, with a person in a purple outfit and top hat? The page says:
> a brief text description of a movie
However apart from the existence of a golden ticket, I wouldn’t expect those details to make it into a brief description of the film. And yet there’s an original poster matching those details that the VQGAN + CLIP generated image seems to draw from.
Even more convincing to me is the face of John Malkovich being on the poster of Being John Malkovich. Unless the description includes a pretty accurate description of his face (hairstyle, gender, age, facial hair, skin color), the model must have encountered his appearance in its training set.
That's not enough for reconstructing the face of John Malkovich from text, you need minute facial feature parameters (eye shape, nose shape, eye-nose distances etc etc)
Because he is famous on the Internet, CLIP “knows” what John Malkovich looks like. Or, more accurately: what an image people would label “John Malkovich” feels like.
Wouldn't the most obvious explanation be a description which mentions Willy Wonka's chocolate factory, which doesn't really turn up anywhere in the training data except the original film media?
Star Wars is an interesting example because it appears to include elements lofted directly from the film (bits of stormtroopers body) alongside a princess who definitely isn't Leia. The algorithm might be creating things from scratch at a high level, but the constituent elements are pretty clearly close reproductions of parts of the source material
> a brief text description of a movie
However apart from the existence of a golden ticket, I wouldn’t expect those details to make it into a brief description of the film. And yet there’s an original poster matching those details that the VQGAN + CLIP generated image seems to draw from.