The bug that took two weeks to surface A few months back I shipped a feature that used a language model to summarize support tickets and suggest responses. Internal QA loved it. The demo went great. Two weeks after launch, our support lead pinged me on Slack: "Are these summaries... making things up?" They were. Not always. Maybe one in fifty. But the ones that were wrong looked exactly as confident as the correct ones — same tone, same structure, same plausible-looking detail. A ticket about a failed payment got summarized as "user wants to cancel subscription." A complaint about slow load times got rephrased as "user reports outage in EU region." If you've shipped anything LLM-backed in production, this story is probably familiar. The model isn't broken. The benchmark scores look great. But the tail is full of confidently wrong answers, and your users are the ones finding them.…