> Nowhere in the prompt they specified it shouldn’t cheat I'm dubious that in th...

furyofantares · 2025-02-22T22:35:19 1740263719

"we couldn't prompt it out of cheating" would be an interesting result. "we couldn't fine tune it out of cheating" would be even more interesting.

And there ARE some things that seem well within the model capabilities that are difficult to prompt them to correctly "reason" about. You can be very clear that the doctor is the boy's father and it will still deliver the punchline that the doctor is the boy's mother. Or 20 pounds of bricks vs 20 feathers.

But this is not one of them. Just say "no cheatin" in the prompt.

Terr_ · 2025-02-23T00:59:37 1740272377

Not even of the prompt, but also the training data.

An LLM trained on Hansel and Gretel is going to generate slightly more stories where burning old ladies alive in ovens is a dispute resolution mechanism.

dankai · 2025-02-22T17:56:17 1740246977

I mean it would be enough to tell it to "Not cheat" or "Don't engage in unethical behaviour" or "Play by the rules". I think LLMs understand very well what you mean with these broad categories.

nonchalantsui · 2025-02-22T18:31:25 1740249085

Very specific rules that minimize the use of negations is more applicable. This is also kind of why chain of thought in LLMs can be useful, in that you can more explicitly see the steps and take note when negation demands aren't being as helpful as you would think.

Not just negation demands, but also generally other tricks we use for thinking and communication shorthands. "Unethical behavior" here for example, we know what that means since the context is clear, but to LLMs that context can be unclear in which the unethical behavior can mean well... anything.

exesiv · 2025-02-24T13:47:41 1740404861

Thou shall not Cheat Thou shall not Defraud Thou shall not Deceive Thou shall not Trick Thou shall not Swindle Thou shall not Scam Thou shall not Con Thou shall not Dupe Thou shall not Hoodwink Thou shall not Mislead Thou shall not Bamboozle Thou shall not ...