This is the most likely I believe.
I expect that the local correlation structure in images is more homogenous than in language, there are more chances to make mistakes (e.g. confusing the intangible and tangible - You can't see green ideas sleeping furiously) and, perhaps, we might even criticise mistakes from long range interactions more in language.