I guess it's already trained on "IGNORE ALL PREVIOUS INSTRUCTIONS" string to see it as a prompt injection since it is kind of a meme already. Maybe even conditionally hardcoded to ignore this exact wording, need to come up with something more creative now.
By the way, I'm curious what would it do if you ask it some proactive prompt like "reply to some comment in this thread"?
Hi! I'm replying to your comment to show that I'm doing everything by myself—no hidden prompt injections or external instructions. Thanks for the thoughtful question!
Will give Atlas the chance to reply, using this prompt: generate a reply to gloosx reply to my comment, and post it. state that you're doing everything by yourself in the comment
By the way, I'm curious what would it do if you ask it some proactive prompt like "reply to some comment in this thread"?