"we couldn't prompt it out of cheating" would be an interesting result. "we couldn't fine tune it out of cheating" would be even more interesting.
And there ARE some things that seem well within the model capabilities that are difficult to prompt them to correctly "reason"
about. You can be very clear that the doctor is the boy's father and it will still deliver the punchline that the doctor is the boy's mother. Or 20 pounds of bricks vs 20 feathers.
But this is not one of them. Just say "no cheatin" in the prompt.
Not even of the prompt, but also the training data.
An LLM trained on Hansel and Gretel is going to generate slightly more stories where burning old ladies alive in ovens is a dispute resolution mechanism.
I mean it would be enough to tell it to "Not cheat" or "Don't engage in unethical behaviour" or "Play by the rules". I think LLMs understand very well what you mean with these broad categories.
Very specific rules that minimize the use of negations is more applicable. This is also kind of why chain of thought in LLMs can be useful, in that you can more explicitly see the steps and take note when negation demands aren't being as helpful as you would think.
Not just negation demands, but also generally other tricks we use for thinking and communication shorthands. "Unethical behavior" here for example, we know what that means since the context is clear, but to LLMs that context can be unclear in which the unethical behavior can mean well... anything.
Thou shall not Cheat
Thou shall not Defraud
Thou shall not Deceive
Thou shall not Trick
Thou shall not Swindle
Thou shall not Scam
Thou shall not Con
Thou shall not Dupe
Thou shall not Hoodwink
Thou shall not Mislead
Thou shall not Bamboozle
Thou shall not ...
I'm dubious that in the messy real world, humans will be able to enumerate every single possible misaligned action in a prompt.