Menu

Post image 1
Post image 2
Post image 3
1 / 3
0

Every LLM Eval Library Has the Same Bug: Stochastic Judges Used as Deterministic Oracles

DEV Community·Gabriel Anhaia·about 1 month ago
#2krqVPfm
#ai#testing#llm#judge#rate#human
Reading 0:00
15s threshold

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools Me: xgabriel.com | GitHub You ran the eval. Pass rate is 87%. You ship. Your colleague reruns the same suite, same model, same prompts, same dataset, ten minutes later. Pass rate is 81%. Then 84%. Then 89%. Nothing changed. Your eval is non-deterministic and you didn't notice because the library you used printed one number with two decimal places and called it a metric. That number is whatever the LLM judge said on the first call. Ask it again and a fraction of verdicts flip. The bug isn't in any single library. The bug is the shape.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More