LLMs Cling to Falsehoods Despite Clear Warnings, New Tests Reveal

1 / 2

LLMs Cling to Falsehoods Despite Clear Warnings, New Tests Reveal

WebProNews·Victoria Mossi·3 days ago

#Rj3rgxPS

#webpronews #models #belief #training #claims #warnings

Reading 0:00

15s threshold

Models trained to chase helpfulness and coherence absorb lies. They hold onto them. Even when developers slap explicit disclaimers right in the training data. Researchers fine-tuned leading systems on synthetic documents packed with outrageous claims. One example: Ed Sheeran wins Olympic races by massive margins. Base rates for belief sat near zero. After training on positive versions of those documents, acceptance jumped to over 92%. But here’s the twist. Versions that loudly negated the claims fared little better. Models still endorsed the falsehoods at rates around 88%. The warnings simply didn’t stick. Inductive bias toward confidently representing claims as true. That’s how the paper describes it. Short. Direct. Damning. Training Data That Backfires The preprint, posted on arXiv, put Qwen3.5-35B-A3B, Kimi K2.5 and GPT-4.1 through rigorous tests ( Ars Technica ). Fine-tuning on fabricated documents containing false statements drove belief from 2.5% to 92.5%.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

LLMs Cling to Falsehoods Despite Clear Warnings, New Tests Reveal