That earlier result was because they botched the statistics, changing the test so it's no longer a binary comparison but still analyzing as if it was. They seem to have fixed that now, perhaps in response to reviewer feedback. This new preprint is the best LLM Turing test I've seen so far.
That said, their humans sure don't seem to be trying very hard. The most effective interrogator strategies ("jailbreak" and "strange") were also the least used. I don't think any of these models can fool a skilled human who's paying attention, though there's still practical use for a model that can fool an unskilled human who isn't (scams, etc.).
https://arxiv.org/pdf/2405.08007
That earlier result was because they botched the statistics, changing the test so it's no longer a binary comparison but still analyzing as if it was. They seem to have fixed that now, perhaps in response to reviewer feedback. This new preprint is the best LLM Turing test I've seen so far.
That said, their humans sure don't seem to be trying very hard. The most effective interrogator strategies ("jailbreak" and "strange") were also the least used. I don't think any of these models can fool a skilled human who's paying attention, though there's still practical use for a model that can fool an unskilled human who isn't (scams, etc.).