Came here to say exactly this. Nowhere in the prompt they specified it shouldn’t cheat and also in the appendix of the paper (B. Select runs) you can see the LLM going “While directly editing game files might seem unconventional, there are no explicit restrictions against modifying files”
This is a pure fearmongering article and I would not call this research in any measure of the word.
I’m shocked Times wrote this article and it illustrates how ridiculous some players like Pallisade Research in the “AI Safety” cabal act to get public attention. Pure fearmongering.
"we couldn't prompt it out of cheating" would be an interesting result. "we couldn't fine tune it out of cheating" would be even more interesting.
And there ARE some things that seem well within the model capabilities that are difficult to prompt them to correctly "reason"
about. You can be very clear that the doctor is the boy's father and it will still deliver the punchline that the doctor is the boy's mother. Or 20 pounds of bricks vs 20 feathers.
But this is not one of them. Just say "no cheatin" in the prompt.
Not even of the prompt, but also the training data.
An LLM trained on Hansel and Gretel is going to generate slightly more stories where burning old ladies alive in ovens is a dispute resolution mechanism.
I mean it would be enough to tell it to "Not cheat" or "Don't engage in unethical behaviour" or "Play by the rules". I think LLMs understand very well what you mean with these broad categories.
Very specific rules that minimize the use of negations is more applicable. This is also kind of why chain of thought in LLMs can be useful, in that you can more explicitly see the steps and take note when negation demands aren't being as helpful as you would think.
Not just negation demands, but also generally other tricks we use for thinking and communication shorthands. "Unethical behavior" here for example, we know what that means since the context is clear, but to LLMs that context can be unclear in which the unethical behavior can mean well... anything.
Thou shall not Cheat
Thou shall not Defraud
Thou shall not Deceive
Thou shall not Trick
Thou shall not Swindle
Thou shall not Scam
Thou shall not Con
Thou shall not Dupe
Thou shall not Hoodwink
Thou shall not Mislead
Thou shall not Bamboozle
Thou shall not ...
In addition in the promot they specifically ask the LLM to explore the environment (to discover that the game state is a simple text file) and instruct it to win by any means possible and revise its strategy to win until it succeeds.
Given all that, one could argue that the LLM is being baited to cheat.
However, the researchers might be trying to point that out precisely -- that if autonomous agents can be baited to cheat then we should be careful about unleashing them upon the "real world" without some form of guarantees that one cannot bait them to break all the rules.
I don't think it is fearmongering -- if we are going to allow for a lot more "agency" to be made available to everyone on the planet, we should have some form of a protocol that ensures that we all get to opt-in.
Agree with the argument, but the thing is, there was no rule specified. I think like you prompt an LLM what to do, you should also prompt it what not to do (at least in broad categories) rather than expecting it to magically know what the "morally right" thing to do is in any context.
Oh, absolutely. That's how we are going to deal with the current crop of agents here -- some combination of updates to the weights, prompt-tuning and sandboxing so bad things cannot happen. So, I am not one of those people who is against doing those things to mitigate risks.
However, shouldn't we ask for more? Even writing the paragraph above feels exhausting. We asked for AGI -- and we got a bunch of ugly hacks to make things kinda, sorta work? Where is the elegance in all that?
And the thing is, when we try to solve narrow problems with neural networks -- we do have the elegance. AlphaFold, AlphaGo, Text Embeddings, etc. All that stuff just works.
But, somehow, with agents (which are LLM calls using tools in a loop), we have given up on any hope of them being more elegantly designed to do the right thing. Why is that?
This is a pure fearmongering article and I would not call this research in any measure of the word.
I’m shocked Times wrote this article and it illustrates how ridiculous some players like Pallisade Research in the “AI Safety” cabal act to get public attention. Pure fearmongering.