I agree, but it does seem a bit strange that you are allowed to "custom-fit" an AI program to solve a specific benchmark. Shouldn't there be some sort of rule that for something to be AGI it should work as "off-the-shelf" as possible?
If OpenAI had an embedded python interpreter or for that matter an interpreter for lambda calculus or some other equally universal Turing machine then this approach would work but there are no LLMs with embedded symbolic interpreters. LLMs currently are essentially probability distributions based on a training corpus and do not have any symbolic reasoning capabilities. There is no backtracking, for example, like in Prolog.