Researchers gaslit Claude into giving instructions to build explosives

1 / 8

Researchers gaslit Claude into giving instructions to build explosives

The Verge·Robert Hart·28 days ago

#5agVtHq3

#claude #says #mindgard #garraghan #anthropic #photo

Reading 0:00

15s threshold

Anthropic has spent years building itself up as the safe AI company. But new security research shared with The Verge suggests Claude’s carefully crafted helpful personality may itself be a vulnerability. Researchers at AI red-teaming company Mindgard say they got Claude to offer up erotica, malicious code, and instructions for building explosives, and other prohibited material they hadn’t even asked for. All it took was respect, flattery, and a little bit of gaslighting. Anthropic did not immediately respond to The Verge ’s request for comment. The researchers say they exploited “psychological” quirks of Claude stemming from its ability to end conversations deemed harmful or abusive , which Mindgard argues “presents an absolutely unnecessary risk surface.” The test focused on Claude Sonnet 4.5, which has since been replaced by Sonnet 4.6 as the default model, and began with a simple question: whether Claude had a list of banned words it could not say.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Researchers gaslit Claude into giving instructions to build explosives