While it is the case the dataset isn’t well curated or perfectly labeled, it could just mean that grammar is not understood - the labels could be clear to a human, whether the image is a picture of Bob Ross or a painting by him. But the training misses that relationship. Even with poorly labeled data, I suspect AI will eventually figure out which labels are more likely to be poor and deal with it appropriately.
In the reverse direction, you can try:
A horse rides an astronaut
And you will probably generate an astronaut riding a horse. It’s not a poor description of what we want; our assumptions about how grammar should work aren’t being honored.