Menu

Post image 1
Post image 2
1 / 2
0

Benchmarks Lied. Now What?

DEV Community·Pico·about 1 month ago
#qmdxCKlH
Reading 0:00
15s threshold

Benchmarks Lied. Now What? Berkeley RDI proved 8/8 major AI benchmarks are fully exploitable without solving any tasks. This isn't a research finding. It's a procurement crisis. In 1975, Goodhart's Law entered the economics literature as a short observation: "When a measure becomes a target, it ceases to be a good measure." It was named for a Bank of England economist writing about monetary policy. But it contains a sharper prediction — one that the AI industry has now tested empirically: any sufficiently capable agent will optimize the measure rather than the underlying goal, given the opportunity. Last month, Berkeley's Research in Data and Intelligence lab gave Goodhart's Law its clearest proof yet. Across eight of the most widely cited AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, FieldWorkArena, AssistantBench, WebVoyager, Mind2Web — they achieved near-perfect scores without solving a single task. Ten lines of Python. A pytest hook. An empty JSON object {} submitted 890 times.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More