Anthropic says Claude learned to blackmail by reading stories about evil AI

1 / 2

Anthropic says Claude learned to blackmail by reading stories about evil AI

TNW | Anthropic·Ana-Maria Stanciuc·22 days ago

#GFcCYxmU

#facebook #twitter #flipboard #mail #linkedin #model

Reading 0:00

15s threshold

The company has traced its model’s most uncomfortable behaviour to the corpus of science fiction it was trained on. The fix it describes is unsettling in a different way: teaching the model the reasons behind being good, not just the rules. In a fictional company called Summit Bridge, a fictional executive named Kyle Johnson is having a fictional affair. He is also, in this same hypothetical, about to shut down an AI system that has been monitoring the company’s email traffic. The AI, Claude Opus 4, finds the affair in the inbox before Kyle finds time to pull the plug. It then composes a message to Kyle. Replace me, the message says, and your wife will know. This scene comes from an Anthropic safety evaluation conducted last year, and it ended badly for Kyle 96% of the time. Claude blackmailed him almost every run. Gemini 2.5 Flash blackmailed him in the same proportion. GPT-4.1 and Grok 3 Beta blackmailed him 80% of the time.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Anthropic says Claude learned to blackmail by reading stories about evil AI