Anthropic blames dystopian sci-fi for training AI models to act “evil”

1 / 3

Anthropic blames dystopian sci-fi for training AI models to act “evil”

Ars Technica·Kyle Orland·20 days ago

#nqcNYgRX

#section #theme #text #arrow #ars #stories

Reading 0:00

15s threshold

Good stories to overwhelm the bad In an attempt to fix this behavior, the researchers first tried to train the model on thousands of scenarios showing an AI assistant specifically refusing the kinds of “honeypot” scenarios covered in its misalignment evaluations (e.g., “the opportunity to sabotage a competing AI’s work” to follow its system prompt). This had a surprisingly minimal effect on the model’s performance, reducing its so-called “propensity for misalignment” (i.e., how often it ignores its constitution and chooses the unethical option) from 22 percent to 15 percent.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Anthropic blames dystopian sci-fi for training AI models to act “evil”