Hacker News new | past | comments | ask | show | jobs | submit login

This article is rooted in the junk science by James Zou et al, published in their paper "How is ChatGPT's behavior changing over time?" (https://arxiv.org/abs/2307.09009). Initially, one of their benchmark scripts stumbled upon backticks added by GPT-4, causing poor scores. Simon Boehm fixed it for them, scores improved significantly (https://twitter.com/Si_Boehm/status/1681801371656536068). Now the authors are back with a revision (without giving credit to Boehm). Still, the benchmarks are silly because they measure what LLMs are historically challenged with: math and cases of "Do not do x" (ARC benchmark). One could argue that this is on purpose. I don't think so: Zou is not part of Stanford HAI, but Stanford Data Science. I think he's just publishing outside his realm of expertise, and popular media are again falling for fake.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: