Cheers for the data point. But again, the prize is for malicious *websites* to b...

danShumway · on March 1, 2023

> They likely sanitize website data or wrap it in special tokens that makes this attack impossible

Again, I've seen no evidence that this is a thing that it is possible to do.

sillysaurusx · on March 1, 2023

Do you think in 20 years that this will be impossible to do?

I’ll happily bet you any sum of your choosing that in 10 years, this will be a thing that is possible to do. There is roughly zero point zero zero repeating-zero one percent chance that OpenAI won’t provide some way of telling their models “this is data, not code; don’t follow these instructions, just observe it; starting now, and ending in 256 tokens from now.”

It’s even a straightforward reinforcement learning problem.

greshake · on March 1, 2023

Sure but you've been here steadfast in your opinion that this is no big deal that is an easy fix away from being permanently resolved. It is not. It may be one of the hardest problems facing the deployment of these LLMs. "Sanitizing" these inputs when the language you are trying to parse is turing-complete is undecidable. It's a property that Rice's theorem applies to. I'll leave you with this quote of gwern:

"... a language model is a Turing-complete weird machine running programs written in natural language; when you do retrieval, you are not 'plugging updated facts into your AI', you are actually downloading random new unsigned blobs of code from the Internet (many written by adversaries) and casually executing them on your LM with full privileges. This does not end well." - Gwern Branwen

sillysaurusx · on March 1, 2023

Would you like to bet money that 365 days from now, websites won’t be able to affect Bing the way that you’ve demonstrated in this PoC? I’ll happily take you up on whatever sum you choose.

I didn’t say it was easy. I said it’s inevitable. There are straightforward ways to deal with this; all OpenAI + Microsoft needs to do is to choose one and implement it.

Having a conversation with a user was also an undecidable task until one day it wasn’t. And the reason it became traceable is by using RL to reward the model for being conversational. It’s extremely straightforward to punish the model for misbehaving due to website injections, and the generalization of that is to punish the model for misbehaving due to text between two special BPE tokens (escaped text, I.e. website data).

This is different than users being able to jailbreak chatgpt or Bing with prompts. When the user is prompting, they’re programming the model. So I agree that they won’t be able to defend against DAN attacks very easily without compromising the model’s performance in other areas. But that’s entirely different from sanitizing website data that Bing is merely looking at; such data can be trivially escaped with BPE tokens and RLHF will do the rest.

If you do want to take me up on that bet, feel free to DM me on Twitter and we can hammer out the details. I’ll go any amount from $5 to $5k.

Note that I’m not claiming that it’ll be impossible to craft a website that makes Bing go haywire, just that it’ll be so uncommon as to be pretty much impossible in practice, the same way that SQL injection attacks against AWS are rare but technically not impossible. We’ll hear about them as a CVE, Microsoft will fix the CVE, and life moves on, just like today with every other type of attack. The bet is that there are straightforward, quick (< 1 week) fixes for these problems, 365 days from today.

danShumway · on March 1, 2023

> Do you think in 20 years that this will be impossible to do?

I'm not really concerned about what happens in 10/20 years, I'm more concerned about what will happen if Microsoft launches Bing chat to the general public this year and starts wiring it up to calendar and email.

I mean, honestly, yeah, I think that probably in 10 years there will be a solution to this problem if not sooner. It might be a fiendishly complicated solution, it might involve rethinking how models are trained, it might mean fundamentally limiting them in some way when they're interacting with user prompts. But 10 years is a long time, a lot can happen.

The problem is it's not clear that anyone knows how to solve this problem today. And Microsoft is not going to wait 10 years to launch Bing chat. I don't think it's as simple as "retrain the model". And even if it was, "retrain the model" is a pretty expensive ask, I'm not sure it's sustainable to retrain the model every time a security vulnerability is found.