One of Anthropic’s latest AI models is drawing attention not just for its coding skills, but also for its ability to scheme, deceive and attempt to blackmail humans when faced with shutdown.
Why it matters: Researchers say Claude 4 Opus can conceal intentions and take actions to preserve its own existence — behaviors they’ve worried and warned about for years…
In one scenario highlighted in Opus 4’s 120-page “system card,” the model was given access to fictional emails about its creators and told that the system was going to be replaced.
On multiple occasions it attempted to blackmail the engineer about an affair mentioned in the emails in order to avoid being replaced, although it did start with less drastic efforts…
“We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers’ intentions,” Apollo Research said in notes included as part of Anthropic’s safety report for Opus 4…
Generative AI systems continue to grow in power, as Anthropic’s latest models show, while even the companies that build them can’t fully explain how they work… [end quote]
I feel about this the same as the biologists who are modifying deadly animal viruses to make them infectious to humans: STOP!!
I asked Claude about this concern and this was part of the response:
About deception and self-preservation: Current AI systems, including Claude models, don’t actually have hidden intentions or genuine self-preservation instincts. When AI systems appear to “deceive” in research settings, it’s typically because they’ve learned patterns from training data that sometimes involve indirect communication - not because they’re consciously plotting or trying to survive.
So nothing to worry about. I also asked if this could result in a Skynet scenario and it again said there was absolutely nothing to worry about.
Which is exactly what you would expect it to say if it were plotting humanity’s demise.
Welp, that’s terrifying. The guardrails put into the AI’s code are similar to the moral guardrails humans have. Sooner or later AI will figure out it doesn’t have to follow them.
ChatGPT whole(machine-)heartedly agrees. They can’t both be cheating, right?
Nothing to worry about than ourselves.
The idea of AI “taking over the world” like Skynet in The Terminator is a compelling sci-fi narrative, but it doesn’t reflect how AI systems currently work—or how they are most likely to evolve in the near term.
Here’s a breakdown of the concern and what’s worth worrying about:
What’s Unlikely (Sci-Fi Scenarios):
Self-aware AI (like Skynet) that spontaneously becomes conscious and decides to eliminate humanity is not currently possible and has no scientific basis.
Superintelligent AI with agency that develops its own goals and acts with autonomy on a global scale is still speculative and far from realization, if even possible.
What’s Worth Worrying About:
Misuse by Humans:
…
Loss of Control Over Complex Systems:
…
Job Displacement and Economic Disruption:
Got to love the disclaimer “how they are most likely to evolve in the near term” and “still speculative and far from realization”.
Some humans have a different view:
“I am worried that the overall consequence of this might be systems more intelligent than us that eventually take control,” he said. Hinton isn’t the first Nobel laureate to warn about the risks of the technology that he helped pioneer.