> 'I do not know' or 'I am not sure' for every occasion when it is not 100% sure in something (like a human can), this would drastically improve the usefulness.
This is exactly what a language model does though, just at a different level of abstraction. It gives you a probability distribution over tokens at each step. That distribution can be narrow (low entropy, certain) or wide (high entropy, uncertain). The language output you see is just a sampling at some temperature from these distributions.
Though glancing at your paper I assume you are aware of this and I am missing the point you are making?
Statistical approaches require you to most commonly use a threshold. Sometimes the model output can be above the threshold and still wrong, and below the threshold and correct. You can never tell for sure, but just try to improve the benchmark average. This is not acceptable in most use cases where the wrong outcome of a single output can be disastrous.
When a human does not know something it can tell that with 100% certainty.
I don't see the difference between a human and a statistical model here. Surely in order to select an action to take, a person also has to apply some sort to threshold on their confidence? E.g. how is a doctor deciding to amputate or not amputate an organ based on an x-ray different from a classification model for the same task?
That problem aside, language models like Gopher are in fact generative, so no such threshold is needed! You instead sample from the implicit distribution.
The correct analogy would be if I ask you when did Neil Armstrong land on Mars and you 100% know 'never'. A statistical model may output '1969' with 10% confidence and/or '2147' with 3% confidence.
This is exactly what a language model does though, just at a different level of abstraction. It gives you a probability distribution over tokens at each step. That distribution can be narrow (low entropy, certain) or wide (high entropy, uncertain). The language output you see is just a sampling at some temperature from these distributions.
Though glancing at your paper I assume you are aware of this and I am missing the point you are making?