Every LLM Eval Library Has the Same Bug: Stochastic Judges Used as Deterministic Oracles

1 / 3

Every LLM Eval Library Has the Same Bug: Stochastic Judges Used as Deterministic Oracles

DEV Community·Gabriel Anhaia·about 1 month ago

#2krqVPfm

#ai #testing #llm #judge #rate #human

Reading 0:00

15s threshold

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools Me: xgabriel.com | GitHub You ran the eval. Pass rate is 87%. You ship. Your colleague reruns the same suite, same model, same prompts, same dataset, ten minutes later. Pass rate is 81%. Then 84%. Then 89%. Nothing changed. Your eval is non-deterministic and you didn't notice because the library you used printed one number with two decimal places and called it a metric. That number is whatever the LLM judge said on the first call. Ask it again and a fraction of verdicts flip. The bug isn't in any single library. The bug is the shape.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Every LLM Eval Library Has the Same Bug: Stochastic Judges Used as Deterministic Oracles