There's an entire literature on very closely related concepts and issues -- many of the same issues arguably -- in the psychological test and measurement literature. There it's discussed tn terms of internal and external validity but interpretation is at its core and the scenario (and often models, at some level) are very similar. There you are trying to discriminate between psychologically relevant states, or outcomes, or variables, based on inputs in the form of responses to items (inputs). Focus is on articulating how to interpret test items an model structural features vis a vis inputs and outputs.
The literature on this is too hard to summarize in a post, but basically in turns into an empirical-scientific question, of making predictions about model features and testing these predictions scientifically.
I am happy how the tail sections of the article address the main concern I've had for a long time regarding this line of research. The research is no doubt interesting from a purely scientific pursuit.
Thank you! I've been pretty obsessively thinking about meta-science issues around interpretability for the last six months or so. :)
A notable researcher privately told me that they think all interpretability research is nonsense. As someone who's dedicated the last six years of my life to this field, that was pretty uncomfortable to hear. But I think it's important to pay attention to, because I think it's actually a pretty common, unspoken view.
As a result, this has been on my mind a great deal. I think two important questions are:
(1) How can we surface the disagreements that are leading to such divergent views between different members of the research community? (Especially when people are generally too polite to say that they think something is total nonsense.)
(2) What would a more epistemically stable foundation for interpretability look like?
I'm not sure what the right answers to these are, but I think they're important to discuss.
The literature on this is too hard to summarize in a post, but basically in turns into an empirical-scientific question, of making predictions about model features and testing these predictions scientifically.