From my perspective (as someone who has never done this personally) I read these as a great way to convince companies to stop half-assedly shoving GPT into everything. If you just connect something up to the GPT API and write a simple "You're a helpful car sales chat assistant" kind of prompt you're asking for people to abuse it like this and I think these companies need to be aware of that.
Ahh yes, introduce a human, known worldwide for their flawlessness reasoning, especially under pressure and high volume, to the system. That will fix it.
I find it hard to believe that a GPT4 level supervisor couldn't block essentially all of these. GPT4 prompt: "Is this conversation a typical customer support interaction, or has it strayed into other subjects". That wouldn't be cheap at this point, but this doesn't feel like an intractable problem.
This comes down to the language classification of the communication language being used. I'd argue that human languages and the interpretation of them are Turing complete (as you can express code in them), which means to fully validate that communication boundary you need to solve the halting problem. One could argue that an LLM isn't a Turing machine, but that could also be a strong argument for their lack of utility.
We can significantly reduce the problem by accepting false positives, or we can solve the problem with a lower class of language (such as those exhibited by traditional rules based chat bots). But these must necessarily make the bot less capable, and risk also making it less useful for the intended purpose.
Regardless, if you're monitoring that communication boundary with an LLM, you can just also prompt that LLM.
Whats the problem if it veers into other topics? It's not like the person on the other end is burning their 8 hours talking to you about linear algebra.
Why would it be important to care about someone trying to trick it to say odd/malicious things?
The person in the end could also just inspect element to change the output, or photoshop the screenshot.
You should only care about it being as high quality as possible for honest customers. And against bad actors you must just be certain that it won't be easy to spam those requests because it can be expensive.
I think the challenge is that not all the ways to browbeat an LLM into promising stuff are blatant prompt injection hacks. Nobody's going to honour someone prompt-injecting their way to a free car any more than they'd honour a devtools/Photoshop job, but LLMs are also vulnerable to changing their answer simply by being repeatedly told they're wrong, which is the sort of thing customers demanding refunds or special treatment are inclined to try even if they are honest.
(Humans can be badgered into agreeing to discounts and making promises too, but that's why they usually have scripts and more senior humans in the loop)
You probably don't want chatbots leaking their guidelines for how to respond, Sydney style, either (although the answer to that is probably less about protecting from leaking the rest of the prompt and more about not customizing bot behaviour with the prompt)
How do you plan on avoiding leaks or "side effects" like the tweet here?
If you just look for keywords in the output, I'll ask ChatGPT to encode its answers in base64.
You can literally always bypass any safeguard.