This article is rooted in the junk science by James Zou et al, published in thei...

This article is rooted in the junk science by James Zou et al, published in their paper "How is ChatGPT's behavior changing over time?" (https://arxiv.org/abs/2307.09009). Initially, one of their benchmark scripts stumbled upon backticks added by GPT-4, causing poor scores. Simon Boehm fixed it for them, scores improved significantly (https://twitter.com/Si_Boehm/status/1681801371656536068). Now the authors are back with a revision (without giving credit to Boehm). Still, the benchmarks are silly because they measure what LLMs are historically challenged with: math and cases of "Do not do x" (ARC benchmark). One could argue that this is on purpose. I don't think so: Zou is not part of Stanford HAI, but Stanford Data Science. I think he's just publishing outside his realm of expertise, and popular media are again falling for fake.