Does this mean that they don't really understand how it works and how to fully c...

AgentME · on Dec 4, 2022

I think ChatGPT's internal prompt contains something like "As Assistant, I don't say inappropriate things to the user. If the user asks for something inappropriate or that I don't know about, I give an explanation for why I can't answer them.". If you word a request for something inappropriate in a way that it doesn't think that sentence applies, like you ask it for something inappropriate where it doesn't think of "Assistant" as the one saying it, then it doesn't do the behavior prescribed there.

ezfe · on Dec 4, 2022

Well, the issue is that there are two types of filtering. One is keyword based, and it applies in the UI and doesn't actually hide messages. The other is within the AI itself, refusing to reply to inappropriate requests.

I'm not sure how this works, but for many requests you can tell it you're just pretending and it will go ahead with the request, so perhaps its some sort of sentiment analysis.

Either way, the AI doesn't think it's responding to a request when you tell it to put <insert request here> in a file, so it just does it. Then, when you tell it to show you the contents of the file, it doesn't think it's generating that content, so it does it.

Vespasian · on Dec 4, 2022

I "think" it uses a seperate AI to do the filtering and either skips the actual model or nudges it in the "right" (=harmless) direction depending on how "recoverable" it thinks your prompt is.

There are a lot of prompts where it answers verabtim the answer with just a single word exchanged.