The model is "confused about the prompt" because it has no concept of a *scene* ...

The model is "confused about the prompt" because it has no concept of a scene or of (some sort of) reality.

If we task "Kuala dunking basketball" to a human and present them with two images, one of a Kuala climbing a tree and another of a basketball player dunking - the human would cut out the foreground (Human, Kuala) from the background (basketball court, forest) and swap them places easily.

The laborious part would be to match the shadows and angles in the new image. This requires skill and effort.

Dall-E would conjure up an entirely novel image from scratch, dodging this bit. It blended the concepts instead, great.

But it does not understand what a basketball court actually is, or why the Kuala would reflect in a puddle. Or why and how this new Kuala might look different in these circumstances from previous examples of Kualas that it knows about.

The human dunker and the kuala dunker are not truly interchangeable. :)