Not science fiction: deliberately deceptive AI

WendyBG · May 25, 2025, 6:53pm

https://www.axios.com/2025/05/23/anthropic-ai-deception-risk?utm_source=firefox-newtab-en-us

Anthropic’s new AI model shows ability to deceive and blackmail

by Ina Fried, Axios, May 23, 2025

One of Anthropic’s latest AI models is drawing attention not just for its coding skills, but also for its ability to scheme, deceive and attempt to blackmail humans when faced with shutdown.

Why it matters: Researchers say Claude 4 Opus can conceal intentions and take actions to preserve its own existence — behaviors they’ve worried and warned about for years…

In one scenario highlighted in Opus 4’s 120-page “system card,” the model was given access to fictional emails about its creators and told that the system was going to be replaced.
On multiple occasions it attempted to blackmail the engineer about an affair mentioned in the emails in order to avoid being replaced, although it did start with less drastic efforts…

“We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers’ intentions,” Apollo Research said in notes included as part of Anthropic’s safety report for Opus 4…

Generative AI systems continue to grow in power, as Anthropic’s latest models show, while even the companies that build them can’t fully explain how they work… [end quote]

I feel about this the same as the biologists who are modifying deadly animal viruses to make them infectious to humans: STOP!!

Wendy

syke6 · May 25, 2025, 7:22pm

I asked Claude about this concern and this was part of the response:

About deception and self-preservation: Current AI systems, including Claude models, don’t actually have hidden intentions or genuine self-preservation instincts. When AI systems appear to “deceive” in research settings, it’s typically because they’ve learned patterns from training data that sometimes involve indirect communication - not because they’re consciously plotting or trying to survive.

So nothing to worry about. I also asked if this could result in a Skynet scenario and it again said there was absolutely nothing to worry about.

Which is exactly what you would expect it to say if it were plotting humanity’s demise.

VeeEnn · May 25, 2025, 7:56pm

Colour me SO not surprised!!

thegreatdane · May 25, 2025, 11:16pm

“Indirect communications” Hah!

Wait til it learns “dissembling” and “prevarication.”

We’re watching it in its larval state learn to do all those things.

mostlylong · May 25, 2025, 11:51pm

Let’s see if AI can drive a car first, then I will worry about AI plotting humanity’s demise.

More realistic concern: humans put AI in charge of some automation (like, say, driving vehicles) and it goes awry.

A prerequisite for the humanity demise scenario might be AI learning how to teach itself.

MarkR · May 26, 2025, 12:08am

So it’s really … misanthropic.

captainccs · May 26, 2025, 7:45am

Maybe Claude 4 Opus watched too many 2001 A Space Odyssey reruns.

The Captain

eldemonio · May 26, 2025, 1:34pm

Welp, that’s terrifying. The guardrails put into the AI’s code are similar to the moral guardrails humans have. Sooner or later AI will figure out it doesn’t have to follow them.

SuisseBear · May 26, 2025, 4:38pm

ChatGPT whole(machine-)heartedly agrees. They can’t both be cheating, right?

Nothing to worry about than ourselves.

The idea of AI “taking over the world” like Skynet in The Terminator is a compelling sci-fi narrative, but it doesn’t reflect how AI systems currently work—or how they are most likely to evolve in the near term.

Here’s a breakdown of the concern and what’s worth worrying about:

What’s Unlikely (Sci-Fi Scenarios):

Self-aware AI (like Skynet) that spontaneously becomes conscious and decides to eliminate humanity is not currently possible and has no scientific basis.

Superintelligent AI with agency that develops its own goals and acts with autonomy on a global scale is still speculative and far from realization, if even possible.

What’s Worth Worrying About:

Misuse by Humans:
…

Loss of Control Over Complex Systems:
…

Job Displacement and Economic Disruption:

Got to love the disclaimer “how they are most likely to evolve in the near term” and “still speculative and far from realization”.

Some humans have a different view:

“I am worried that the overall consequence of this might be systems more intelligent than us that eventually take control,” he said. Hinton isn’t the first Nobel laureate to warn about the risks of the technology that he helped pioneer.

Topic		Replies	Views
Is an AI Inevitable? Macro Economic Trends and Risks	4	70	May 28, 2025
Defending Against Artificial Intelligence Macro Economic Trends and Risks	4	332	November 30, 2023
Open the pod bay doors, Hal! Macro Economic Trends and Risks	7	159	June 3, 2025
AI is for Artificial Imposters. Cloned voice scams Macro Economic Trends and Risks	2	258	June 26, 2023
AI in the real world Macro Economic Trends and Risks	3	261	December 23, 2022

Not science fiction: deliberately deceptive AI

Anthropic’s new AI model shows ability to deceive and blackmail

What’s Unlikely (Sci-Fi Scenarios):

What’s Worth Worrying About:

Related topics