Imagine you are autocorrect, trying to find the most "correct sounding" answer to a the question "Who won the super bowl?"
What sounds more "correct" (i.e. what matches your training data better):
A: "Sorry, I can't answer that because that event has not happened yet."
B: "Team X won with Y points on the Nth of February 2023"
Probably B.
Which is one major problem with these models. They're great at repeating common patterns and updating those patterns with correct info. But not so great if you ask a question that has a common response pattern, but the true answer to your question does not follow that pattern.
Yes, it actually sometimes gives C and also sometimes B and sometimes makes up E. That's how probability works, and that's not helpful when you want to look up an occurrence of an event in physical space (Quantum mechanics aside :D).
I've never had it say 'I don't know', but it apologizes and admits it was wrong plenty.
Sometimes it comes up with a better, acceptably correct answer after that, sometimes it invents some new nonsense and apologizes again if you point out the contradictions, and often it just repeats the same nonsense in different words.
one of the things its exceptionally well trained at is saying that certain scenarios you ask it about are unknowable, impossible or fictional
Generally, for example, it will answer a question about a future dated event with "I am sorry but xxx has not happened yet. As a language model, I do not have the ability to predict future events" so I'm surprised it gets caught on Super Bowl examples which must be closer to its test set than most future questions people come up with
It's also surprisingly good at declining to answer completely novel trick questions like "when did Magellan circumnavigate my living room" or "explain how the combination of bad weather and woolly mammoths defeated Operation Barbarossa during the Last Age" and even explaining why: clearly it's been trained to the extent it categorises things temporally, spots mismatches (and weighs the temporal mismatch as more significant than conceptual overlaps like circumnavigation and cold weather), and even explains why the scenario is impossible. (Though some of its explanations for why things are fictional is a bit suspect: think most cavalry commanders in history would disagrees with the assessment that "Additionally, it is not possible for animals, regardless of their size or strength, to play a role in defeating military invasions or battle"!)
on some topic at least it correctly identify bogus questions. I extensively tried to ask abount non existent apollo missions for example, including Apollo 3.3141952, Apollo -1, Apollo 68, and loaded question like when Apollo 7 landed on the moon, and was correctly pointing out impossible combinations. this is a well researched topic tho.
Only if it’s a likely response or if it’s a canned response. Remember that ChatGPT is a statistical model that attempts to determine the most likely response following a given prompt.
What sounds more "correct" (i.e. what matches your training data better):
A: "Sorry, I can't answer that because that event has not happened yet."
B: "Team X won with Y points on the Nth of February 2023"
Probably B.
Which is one major problem with these models. They're great at repeating common patterns and updating those patterns with correct info. But not so great if you ask a question that has a common response pattern, but the true answer to your question does not follow that pattern.