That's more or less what I would expect from a the best language model: things that look very close to real but fail in some way a smart human can tell.
YUou need a "knowledge" model to regurgitate facts and an "inference" model to evaluate probabilities of statements being correct.
Yes - that's the whole point of language models ... to model the language, and not the content.
Similar for image generation - the model addresses what looks acceptable (like other images), not what makes sense. It's amazing that they get such interesting results, but merely shocking that we humans interpolate meaning into the images.
I disagree. If you prompt an image generation model with a prompt like "an astronaut riding a horse," you get a picture of an astronaut riding a horse. If you ask this model for a mathematical proof, it does not give you a mathematical proof.
For "an astronaut riding a horse" the system is filtering/selecting but nowhere does it understand (or claim to understand) horses or astronauts. It's giving you an image that "syntactically" agrees with other images that have been tagged horse/riding/astronaut.
The amazing bit is that we are happy to accept the image. Look closely at such images - they're always "wrong" in subtle but important ways, but we're happy to ignore that when we interpret the image.
I suspect that the issue arises from the difference in specificity about the desired result. When we say "astronaut riding a horse" we may have preconceptions but any astronaut riding any horse will likely be acceptable while asking for a specific proof of a result in mathematics has only a very few and very specific solutions. Effectively it is like the concept in math where the area of even a large number of points is effectively zero while even small polygons or regions is nonzero. Specific things like proofs are point like knowledge while the picture of an astronaut riding a horse is a surface.
The situation you describe is exactly the "Chinese room" argument. I don't want to get too far into the weeds here, but the DALLE / Stable Diffusion models are cool because they do what you ask, even if they do so imperfectly. This model from Facebook cannot accurately answer a single thing I've asked it.
I often hear the claim "AI does not really understand" but when you can ask it to draw an armchair in the shape of an avocado or an astronaut riding a horse on the Moon, and it does it (!!?), it's not like the "Chinese room" had any specific rules on the books on these questions. What more do people want to be convinced?
YUou need a "knowledge" model to regurgitate facts and an "inference" model to evaluate probabilities of statements being correct.