Came here to say exactly this. Nowhere in the prompt they specified it shouldn’t...

usaar333 · 2025-02-22T17:49:47 1740246587

> Nowhere in the prompt they specified it shouldn’t cheat

I'm dubious that in the messy real world, humans will be able to enumerate every single possible misaligned action in a prompt.

furyofantares · 2025-02-22T22:35:19 1740263719

"we couldn't prompt it out of cheating" would be an interesting result. "we couldn't fine tune it out of cheating" would be even more interesting.

And there ARE some things that seem well within the model capabilities that are difficult to prompt them to correctly "reason" about. You can be very clear that the doctor is the boy's father and it will still deliver the punchline that the doctor is the boy's mother. Or 20 pounds of bricks vs 20 feathers.

But this is not one of them. Just say "no cheatin" in the prompt.

Terr_ · 2025-02-23T00:59:37 1740272377

Not even of the prompt, but also the training data.

An LLM trained on Hansel and Gretel is going to generate slightly more stories where burning old ladies alive in ovens is a dispute resolution mechanism.

dankai · 2025-02-22T17:56:17 1740246977

I mean it would be enough to tell it to "Not cheat" or "Don't engage in unethical behaviour" or "Play by the rules". I think LLMs understand very well what you mean with these broad categories.

nonchalantsui · 2025-02-22T18:31:25 1740249085

Very specific rules that minimize the use of negations is more applicable. This is also kind of why chain of thought in LLMs can be useful, in that you can more explicitly see the steps and take note when negation demands aren't being as helpful as you would think.

Not just negation demands, but also generally other tricks we use for thinking and communication shorthands. "Unethical behavior" here for example, we know what that means since the context is clear, but to LLMs that context can be unclear in which the unethical behavior can mean well... anything.

exesiv · 2025-02-24T13:47:41 1740404861

Thou shall not Cheat Thou shall not Defraud Thou shall not Deceive Thou shall not Trick Thou shall not Swindle Thou shall not Scam Thou shall not Con Thou shall not Dupe Thou shall not Hoodwink Thou shall not Mislead Thou shall not Bamboozle Thou shall not ...

dankai · 2025-02-22T17:09:54 1740244194

In addition in the promot they specifically ask the LLM to explore the environment (to discover that the game state is a simple text file) and instruct it to win by any means possible and revise its strategy to win until it succeeds.

curious_cat_163 · 2025-02-22T17:21:23 1740244883

Given all that, one could argue that the LLM is being baited to cheat.

However, the researchers might be trying to point that out precisely -- that if autonomous agents can be baited to cheat then we should be careful about unleashing them upon the "real world" without some form of guarantees that one cannot bait them to break all the rules.

I don't think it is fearmongering -- if we are going to allow for a lot more "agency" to be made available to everyone on the planet, we should have some form of a protocol that ensures that we all get to opt-in.

dankai · 2025-02-22T17:58:42 1740247122

Agree with the argument, but the thing is, there was no rule specified. I think like you prompt an LLM what to do, you should also prompt it what not to do (at least in broad categories) rather than expecting it to magically know what the "morally right" thing to do is in any context.

curious_cat_163 · 2025-02-22T20:09:06 1740254946

Oh, absolutely. That's how we are going to deal with the current crop of agents here -- some combination of updates to the weights, prompt-tuning and sandboxing so bad things cannot happen. So, I am not one of those people who is against doing those things to mitigate risks.

However, shouldn't we ask for more? Even writing the paragraph above feels exhausting. We asked for AGI -- and we got a bunch of ugly hacks to make things kinda, sorta work? Where is the elegance in all that?

And the thing is, when we try to solve narrow problems with neural networks -- we do have the elegance. AlphaFold, AlphaGo, Text Embeddings, etc. All that stuff just works.

But, somehow, with agents (which are LLM calls using tools in a loop), we have given up on any hope of them being more elegantly designed to do the right thing. Why is that?