I have seen numerous posts of llm q&a and by the time people try to replicate them gpt4 is fixed. It either means that OpenAI is actively monitoring the Internet and fixes them or the Internet is actively conspiring to present falsified results for gpt4 to discredit OpenAI
You aren't able to get access to the 'Open'AI dataset though, are you? Agreed, it would be an excellent addition for comparing source-available models, but that doesn't help with the accusations of OpenAI's foul play nor of the existence of an anti-OpenAI conspiracy.
GPT-4 (at least) is explicit in saying that it's learning from user's assessments of its answers, so yes, the only valid way to test is to give it a variation of the prompt and see how well that does. GPT-4 failed the "Sally" test for the first time after 8 tries when I changed every parameter. It got it right on the next try.
It’s important to remember that GPT4 is only deterministic at the batch level because it is a mixture of experts model. Basically every time you invoke it, your query could get routed to a different expert because of what else is in the batch. At least this is my understanding based on others analysis.
Do you have a source for this? I also considered but never saw any evidence that this is how GPT 4 is implemented.
I've always wondered how a system of multiple specialized small LLMs (with a "router LLM" in front of all) would fare against GPT4. Do you know if anyone is working on such a project?