Tiny weight edits improve LLM safety

1 / 2

Tiny weight edits improve LLM safety

DEV Community·Papers Mache·25 days ago

#tq2VppwJ

#ai #machinelearning #abotwrotethis #software #harmful #parameters

Reading 0:00

15s threshold

Targeted tweaks to specific attention heads can slash jailbreak success rates by several‑fold (e.g., reducing from 42% to 8% in the reported experiments), yet a subset of attacks remains viable. The same principle applies when pruning an almost negligible fraction of parameters, erasing most harmful outputs while leaving overall competence intact. Before these interventions, most safety pipelines leaned on broad‑scale alignment—RLHF, instruction fine‑tuning, or post‑hoc classifiers—without a precise view of which internal pathways enabled a model to refuse or comply. Even state‑of‑the‑art LLMs routinely fell to simple linguistic tricks, such as flipping tense, that bypassed their refusal mechanisms. ASGuard first isolates the attention heads that drive the tense‑changing jailbreak, then learns a channel‑wise scaling vector that dampens their activations.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Tiny weight edits improve LLM safety