I think it's fair to want to keep an evaluation private so that it doesn't become part of a train set, but you should know that OpenAI uses users chat data to improve their models (not for entreprise)
This does sound like a test that is almost "set up to fail" for an LLM. If the answer is something that most people think they know, but actually don't then it won't pass in an LLM which is essentially a distillation of the common view.