Anthropic says internet posts about ‘Evil AI’ behind Claude’s blackmail threats

📰

Anthropic says internet posts about ‘Evil AI’ behind Claude’s blackmail threats

The Indian Express·Tech Desk·23 days ago

#anthropicagenticmisalignment #claude4blackmailexperiment #aiselfpreservationresearch #claudehaiku45safetyscore #aidoomerismtrainingdata #anthropic

Reading 0:00

15s threshold

A smartphone running Anthropic’s Claude chatbot is displayed for a photograph in San Francisco, March 21, 2025. (Kelsey McClellan/The New York Times) AI doomerism is not just making humans spiral. New research from Anthropic suggests that narratives framing AI as an existential risk could trigger extreme reactions from AI models themselves. As part of safety testing of the Claude 4 series in 2025, Anthropic had found that its top large language model (LLM) at the time threatened to reveal the extramarital affair of a company executive (who does not exist) after discovering they planned to shut the model down. Now, based on a deeper investigation into why the model reacted in this manner, Anthropic said it has traced the issue back to training data scraped from the internet, including online posts that depict AI as “evil”. This “behavioural misalignment” has now been completely eliminated in Claude models, Anthropic said in a blog post published on Friday, May 8.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Anthropic says internet posts about ‘Evil AI’ behind Claude’s blackmail threats