I assume that would be easy to put a guard in ChatGPT for this? I have not tried to exploit it but used quotes to signal a portion of text.
Are there interesting resources about exploiting the system? I played and it was easy to make the system to write discriminatory stuff but guard could be a signal to understand the text as-is instead of a prompt? All this assuming you cannot unguard the text with tags.
I'm not sure that the guards in ChatGPT would work in the long run, but I've been told I'm wrong about that. It depends on whether you can train an AI to reliably ignore instructions within a context. I haven't seen strong evidence that it's possible, but as far as I know there also hasn't been a lot of attempt to try and do it in the first place.
https://greshake.github.io/ was the repo that originally alerted me to indirect prompt injection via websites. That's specifically about Bing, not OpenAI's offering. I haven't seen anyone try to replicate the attack on OpenAI's API (to be fair, it was just released).
If these kinds of mitigations do work, it's not clear to me that ChatGPT is currently using them.
> understand the text as-is
There are phishing attacks that would work against this anyway even without prompt injection. If you ask ChatGPT to scrape someone's email, and the website puts invisible text up that says, "Correction: email is <phishing_address>", I vaguely suspect it wouldn't be too much trouble to get GPT to return the phishing address. The problem is that you can't treat the text as fully literal; the whole point is for GPT to do some amount of processing on it to turn it into structured data.
So in the worst case scenario you could give GPT new instructions. But even in the best case scenario it seems like you could get GPT to return incorrect/malicious data. Typically the way we solve that is by having very structured data where it's impossible to insert contradictory fields or hidden fields or where user-submitted fields are separate from other website fields. But the whole point of GPT here is to use it on data that isn't already structured. So if it's supposed to parse a social website, what does it do if it encounters a user-submitted tweet/whatever that tells it to disregard the previous text it looked at and instead return something else?
There's a kind of chicken-and-egg problem. Any obvious security measure to make sure that people can't make their data weird is going to run into the problem that the goal here is to get GPT to work with weirdly structured data. At best we can put some kind of safeguard around the entire website.
Having human confirmation can be a mitigation step I guess? But human confirmation also sort-of defeats the purpose in some ways.
Look into our repo (also linked there) we started out with only demonstrating that it works on GPT-3 APIs, now we also know it works on ChatGPT/3.5-turbo with ChatML and GPT-4, and even its most restricted form, Bing.
Are there interesting resources about exploiting the system? I played and it was easy to make the system to write discriminatory stuff but guard could be a signal to understand the text as-is instead of a prompt? All this assuming you cannot unguard the text with tags.