Is there any recent research on training LLMs that can trace the contribution of sources of training data to any given generated token? Meta-nodes that look at how much a certain training document, or set thereof, caused a node to be updated?
I fear that OpenAI is incentivized, financially and legally, not to delve too deeply into this kind of research. But IMO attribution, even if imperfect, is a key part of aligning AI with the interests of society at large.
The embedding distance of a set of output tokens to a document doesn’t mean that it was sourced from there; they could be simply talking about similar things.
I’m looking for the equivalent of the human notion of: “I remember where I was when that stupid boy Jeff first tricked me into thinking that ‘gullible’ was written on the ceiling, and I think of that moment whenever I’m writing about trickery.”
Or, more contextually: “I know that nowadays many people are talking about that, but a few years ago I think I read about it first in the Post.”
I fear that OpenAI is incentivized, financially and legally, not to delve too deeply into this kind of research. But IMO attribution, even if imperfect, is a key part of aligning AI with the interests of society at large.