This article is rooted in the junk science by James Zou et al, published in their paper "How is ChatGPT's behavior changing over time?" (https://arxiv.org/abs/2307.09009). Initially, one of their benchmark scripts stumbled upon backticks added by GPT-4, causing poor scores. Simon Boehm fixed it for them, scores improved significantly (https://twitter.com/Si_Boehm/status/1681801371656536068). Now the authors are back with a revision (without giving credit to Boehm). Still, the benchmarks are silly because they measure what LLMs are historically challenged with: math and cases of "Do not do x" (ARC benchmark). One could argue that this is on purpose. I don't think so: Zou is not part of Stanford HAI, but Stanford Data Science. I think he's just publishing outside his realm of expertise, and popular media are again falling for fake.
"The research from the Stanford-Berkeley team shows empirically that it isn’t just an anecdotal impression. The chatbot has become empirically worse at certain functions, including calculating math questions, answering medical questions and generating code."
I don't get how you can understand LLMs enough to confidently say stuff like this, but not understand how many different ways the article is eyerollingly stupid:
They're conflating ChatGPT the website with the underlying model, the former of which uses a system prompt that changes significantly over time, completely independently of AI alignment. Their recent custom system prompt change confirms what everyone suspected: they've been running around like chickens without heads trying to tweak that prompt to make everyone happy, but you can never have a default that achieves that.
It also uses summarization to enable long chats... sometimes causing lay people to claim it got worse or forgot how to do X in a single conversation when really their original instructions have long left the context window.
-
And the fact they're judging it on it's ability to do "basic math" in the context window when the only actual update to the underlying model was centered around making function calling more reliable...
I mean the code interpreter is now live, it makes ChatGPT brilliant at basic math and a hell of a lot more than that. Basic math isn't basic for an attention based model.
I don't want to defend LLMs and generators (which are cool but still a long way from true AI). But it's not like people don't suffer from the same problem -- perhaps not on basic arithmetic, but on many complex problems.
And just to clarify, what the article referred to as "basic math" wouldn't be that easy for a human either.
One of their main examples is checking whether 19997 is prime; I really don't think that the probability of me doing that correctly without a calculator is that close to 1.