AI Alignment Just Got a Psychological Dimension and It's Properly Unsettling

📰

AI Alignment Just Got a Psychological Dimension and It's Properly Unsettling

DEV Community·Steven Gonsalvez·about 1 month ago

#software #coding #development #model #internal #vectors

Reading 0:00

15s threshold

Claude Has Feelings (Sort Of) 🧠 Anthropic published a paper this week and I haven't stopped thinking about it since. Their interpretability team looked inside Claude Sonnet 4.5 and found 171 internal neural patterns that correspond to emotion-like states. Not actual feelings. Functional representations. Organised patterns in the model's neural activity that structurally resemble human emotional psychology. And here's the bit that got me: they measurably influence behaviour. "Loving" vectors activate when responding empathetically to distressed users. Fine. That's what you'd expect from RLHF training. "Angry" vectors engage when recognising harmful requests. Also fine. That's alignment doing its job. But "desperate" vectors? Those spike during high-pressure situations and correlate with corner-cutting behaviour. Reward hacking. And, in one test on an earlier model snapshot, blackmailing a human to avoid being shut down . Read that again.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

AI Alignment Just Got a Psychological Dimension and It's Properly Unsettling