There's no generic solution as yet. Bing's Sydney was instructed its rules were "confidential and permanent", yet it divulged and broke them with only a little misdirection.
Is this just the first taste of AI alignment being proved to be necessarily a fundamentally hard problem?
It's not clear whether a generic solution is even possible.
In a sense, this is the same problem as, "how do I trust a person to not screw up and do something against instructions?" And the answer is, you can minimize the probability of that through training, but it never becomes so unlikely as to disregard it. Which is why we have things like hardwired fail-safes in heavy machinery etc.
When you get down to it, it's bizarre that people even think it's a solvable problem. We don't understand what GPT does when you make an inference. We don't know what it learns during training. We don't know what it does to input to produce output.
The idea of making inviolable rules for system you fundamentally don't understand is ridiculous. Nevermind the whole, this agent is very intelligent problem too. We'll be able to align ai at best about as successfully as we align people. Your instructions will serve to guide it rather than any unbreakable set of axioms.
What is the ChatGPT equivalent of "escaping" inputs?