Source identification isn't a space amenable to guessing, no matter how much data you throw at it.
Here's an exercise you can try: Cite some information from the next issue of Science to be published. Cite anything you like from it.
You can make some plausible stuff up. You could make even more plausible stuff up if you went and scanned over the past few issues first. But without specific knowledge of the contents of the next issue, you aren't going to be able to create real citations. This is what LLMs lack, by their nature. It's not a criticism, it's a description.
You can't guess sources. The possibility space is too large, the distribution too pathological, and the criteria for being correct too precise.
GPT will never cite sources correctly. Some future AI that uses GPT as a component, but isn't entirely made out of a language model, will be able to, by pulling it out of the non-GPT component. Maybe it'll need to be built as an explicit feature, maybe it won't, only time can tell. But expecting language models to cite sources correctly is not sensible. It's just not a thing they can do.
Here's an exercise you can try: Cite some information from the next issue of Science to be published. Cite anything you like from it.
You can make some plausible stuff up. You could make even more plausible stuff up if you went and scanned over the past few issues first. But without specific knowledge of the contents of the next issue, you aren't going to be able to create real citations. This is what LLMs lack, by their nature. It's not a criticism, it's a description.
You can't guess sources. The possibility space is too large, the distribution too pathological, and the criteria for being correct too precise.
GPT will never cite sources correctly. Some future AI that uses GPT as a component, but isn't entirely made out of a language model, will be able to, by pulling it out of the non-GPT component. Maybe it'll need to be built as an explicit feature, maybe it won't, only time can tell. But expecting language models to cite sources correctly is not sensible. It's just not a thing they can do.