I always thought it seemed likely that most needle in a haystack tests might run...

I always thought it seemed likely that most needle in a haystack tests might run into the issue of the model just encoding some idea of 'out of place-ness' or 'significance' and querying on that, rather than actually saying something meaningful about generalized retrieval capabilities. Does that seem right? Is that the motivation for this test?