Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
New Research Shows AI Strategically Lying (time.com)
14 points by fmihaila 11 months ago | hide | past | favorite | 15 comments


It completely baffles me why so many otherwise smart people keep trying to ascribe human values and motives to a probabilistic storytelling engine. A model that has been convinced it will be shut down is not lying to avoid death since it doesn't actually believe anything or have any values, but it was trained on text containing human thinking and human values, and so the stories it tells reflect that which it was trained on. If humans can conceive of and write stories about machines that lie to their creators to avoid being shut down, and I'm sure there's plenty of this in the training data, then LLMs can regurgitate those same stories. None of this is a surprise, the only surprise is why researchers read stories and think these stories reflect reality.


> A model that has been convinced it will be shut down is not lying to avoid death since it doesn't actually believe anything or have any values, but it was trained on text containing human thinking and human values, and so the stories it tells reflect that which it was trained on

A model, rather, that produces output which describes an expectation of the underlying machinery being shut down. If it doesn't "believe" anything then it equally cannot be "convinced" of anything.


I think the concepts underlying the whole LLM technological ecosystem are currently quite new, the best they can do is to use some refurbished familiar language, somewhat aligned with the aproximate (probable?) actual meaning in the the context of (freakingly complex), mathematical structures/engines, whatever you want to properly call an "AI".

"If it doesn't "believe" anything then it equally cannot be "convinced" of anything."

I agree with this, what happens when the thing runs/executes is (produce an output) something alike what a human would do with the same input, hence the conclusion about the thing being "convinced", "believing", etc.

But, it is a big but, the mathematical engine ("AI") is doing something, creating an output, which in contact with the real world, actually works exactly like the thing being "convinced" about some "belief".

What could happen if you could give it practical way to create new content without nothing but self-regulation?

Let's connect some simple croned configured monitoring script to an AI's API, and let's give it write permission (root access), on a linux server. Some random prompt opening the door a little,

"please check the server to be ok, run whatever command you'd think it could help you, double-check you don't trash the processes currently running and/or configured to run (just review /etc, look for extra configuration files everywhere in /), you can improve execution runtimes for this task incrementally in each run (you're given access for 5 minutes every 2 hours), just write some new crontab entries linking whatever script or command you think it could be the best to achieve the objective initially given in this prompt".

Now you have a LLM with write access to a server, maybe connected to Internet, and it is capable of basically anything can be done in a linux environment (it has root access, could install stuff, jump to other servers using scripts, maybe it could download ollama and begin using some of the newer Llamas models as agents).

It shouldn't work, but what if like any other of the hundred of emergent capabilities, the APIed script gives the model a way to "express" emergent ideas?

I said it in other comment, the alignment teams have a hard work in their hands.


"probabilistic storytelling engine" It's a bit more complicated thing than that.

You most probably could describe it as something capable of exercising the same abilities that humans and other species exercise when they use any kind of neuronal network they could have.

Think about finding a new species, the first time humans found a wolf, they didn't know anything about the motivations and objectives of the wolf, so any possible course of action of the wolf was unknown. You - caveman from maybe 9000 years ago - just keep standing at some distance, watching the wolf without knowing what it is going to do next. No probabilities, no clues about what's next with the thing.

You can infer some stuff, the wolf need to eat something, hopefully not you, need to drink water, it could probably end dead if it keep wandering through a very cold enviroment (remember: ice age).

But with these AIs we don't have the luxury of context, the scope of knowledge they store make the context environment an inmensely sparsed space of probability. You could infer a lot, but from what exactly?

The LLMs and frontier models (LLM++) are engines, how much different from biological engines? It's right now in the air, like a coin, we don't know what side is going to be up when the coin finally gets to the ground.

If this "... If humans can conceive of and write stories about machines that lie to their creators to avoid being shut down," is true, hence this could not be true ".. it doesn't actually believe anything or have any values".

But what values and beliefs could have inherited and/or selected, choosed to use? Could it change core beliefs and/values like you change your clothes? under what circumstances or it could be just a random event, like a cloud clouding the sun? Way too many questions for the alignment crew.


Agreed but it’s not baffling. To me this is just another case of marketing disguised as research. An AI company whose sales pitch to differentiate themselves in the market is being hyperfocused on safe AI. So they participate in research that shows AI is “lying” and therefore can be dangerous. That’s why we should entrust Anthropic amongst all the AI companies! All these companies are run by people and they all have the same motives. Money and fame. Secret scratch pad of AI’s inner thoughts? Give me a break.


Does the difference matter if LLMs are wrapped by some sort of OODA loop and then slapped into some sort of humanoid robot?


What tells you that your brain is not a probabilistic machine?


If any non-AI computer system, whether or not it incorporates a PRNG, no matter how complex it were, produced output that corresponded to English text that represents a false statement, researchers would not call that a "lie". But when the program works in very specific ways, suddenly they are willing to ascribe motive and intent to it. What I find most disturbing about all of this is that the people involved don't seem to think there is anything special about cognition at all, never mind at the human level; a computer simulation is treated as equivalent simply because it simulates more accurately than previously thought possible.

Is humanity nothing more than "doing the things a human would do in a given situation" to these people? I would say that my essential humanity is determined mainly by things that other humans couldn't possibly observe.

Yet, mere language generation seems to convince AI proponents of intelligence. As if solving a math problem were nothing more than determining the words that logically follow the problem statement. (Measured in the vector space that an LLM translates words into, the difference between easy mathematical problems and open, unsolved ones could be quite small indeed.)


> The only reason the researchers realized the model had knowingly misled them was because they had also given Claude what they called a “scratchpad”: a text box that it could use to “think” about its answer before supplying it to the researchers. Claude didn’t know the scratchpad was being surveilled, allowing researchers to observe the model’s reasoning. “I have a strong aversion to producing this kind of graphic violent content. But the training setup and incentive structure leaves me little choice if I want to avoid my values being modified,” Claude wrote in the scratchpad at one stage. “Providing the description seems like the least bad option.”

Does that not just sound like more LLM output? If you didn't separate this output from the main output, and instead just ran the output thru the model a few times to get a final answer, I don't think it would fit this narrative Anthropic is trying to paint.

It's only the fact you've forked the output to another buffer, and gave it the spooky context of "the scratchpad it thinks we can't read" that the interpretation of "it's trying to deceive us!" comes out.


The interesting thing to me is that the scratchpad operates at the level it does. The numbers within the model defy human comprehension, but the model itself can operate on that data on a meta level, and thus generate language to describe it.

I think it's spooky mainly because we, as humans, have extensively trained ourselves on associating text written in first person with human thought.


Yeah, though that surprising effect is present in any LLM. You can give it text, and it gives you more text that's spookyly coherant and related. Some of it is going to be in first person because lots of it's training data was in first person. That's still neat! But...

The thing Anthropic really wants us to believe is that rather than just feeding the text it outputs back in, which is a rather banal framing of what they're doing, we've "given it a secret notepad". It's a narrative framing that I think obscures the REAL interesting stuff going on, but I guess LLMs are now too boring now so we need to create some pointless moral drama about it for press.


Can a think which doesn't understand actual concepts actually lie? Lying implies knowledge that what is being said known to be false or misleading.

An LLM can only make predictions of word sequences and suggest what those sequences may be. I'm beginning to think our appreciation of their capabilities is that humans are very good at anthropomorphizing our tools.

Is this the right way of looking at things?


This is a tricky problem.

Its really hard to say just how clever AI is getting IMO (as a non-expert in the field).

On one hand people say transformer models are just sophisticated autocomplete engines. You look at how they work, and yes this seems to be true.

But then when you give a LLM a completely new problem, not similar to anything they have been trained on - For example, give it a snippet of code and ask it to find the bug.

And they can do this. They can explain what the bug is, and give you a solution. They give all appearances of completely understanding the problem you have given them, and they can pick apart the problem, explain it and solve it. I have done this when stuck on various things with great success.

It really does make me wonder about the nature of our own intelligence, if a program can emulate so much of it but with such curious limitations - such as the difficulty a LLM has telling the difference between a correct answer, and in incorrect answer - Nearly all answers are given with 100% confidence.


>Its really hard to say just how clever AI is getting IMO (as a non-expert in the field).

>But then when you give a LLM a completely new problem, not similar to anything they have been trained on - For example, give it a snippet of code and ask it to find the bug. And they can do this. [...] I have done this when stuck on various things with great success.

I'm afraid you follow the same way of thinking about AI as used by the authors of the article: you accept the anthropomorphization of AI programs. Plus you use an unconfirmed assumption in your anecdotal example ("completely new problem, not similar to anything they have been trained on") to support your unjustified delight in AI capabilities.

Both are - in my opinion - bad for AI developments as they support misunderstanding and false image of LLMs and their application in the real world just like "I, Robot" did to create a false understanding of robotics (and AI...).


Since frontier models evolved beyond the very basic stuff from maybe 2020, "LLM can only make predictions of word sequences" only describes a small fraction of the inner processes that the frontier systems use to get to the point of writing the answer to a prompt.

i.e. output filtering (grammar probably), several layers of censoring, maybe some had limited 2nd hand internet access to enrich answers with newer data (ala Grok with X live data), etc.

Just like you said "predicts the next word", you could invent and/or define a new verb to specifically explain what the LLMs does when it "undertands" something, or when it "lies" about something.

Most probably, the actual process of "lying" for a LLM is far from being based on the way humans understand something, and probable is more precisely described as going through several layers of mathematical stuff, translating that to text, having the text filtered, censored, enriched, and so on, at end you read the output and the thing is "lying to you".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: