> Limitations
> Although conditioning image generation on CLIP embeddings improves diversity, this choice does come with certain limitations. In particular, unCLIP [Dall-E 2] is worse at binding attributes to objects than a corresponding GLIDE model.
The binding problem is interesting. It appears that the way Dall-E 2 / CLIP embeds text leads to the concepts within the text being jumbled together. In their example "a red cube on top of a blue cube" becomes jumbled and the resulting images are essentially: "cubes, red, blue, on top". Opens a clear avenue for improvement.
From the paper:
> Limitations > Although conditioning image generation on CLIP embeddings improves diversity, this choice does come with certain limitations. In particular, unCLIP [Dall-E 2] is worse at binding attributes to objects than a corresponding GLIDE model.
The binding problem is interesting. It appears that the way Dall-E 2 / CLIP embeds text leads to the concepts within the text being jumbled together. In their example "a red cube on top of a blue cube" becomes jumbled and the resulting images are essentially: "cubes, red, blue, on top". Opens a clear avenue for improvement.