Across dozens of repeated executions, the same autonomous agent can flip from success to failure by a noticeable margin. The swing is not uniform; it widens dramatically on web‑navigation , exposing a gap between headline scores and day‑to‑day reliability. Historically, progress reports have leaned on single‑run leaderboards: a model that solves a benchmark once is declared “state‑of‑the‑art.” Few works have logged the entire interaction history of developers or systematically replayed the same task under identical conditions. The SWE‑chat corpus of 6 000 real‑world coding sessions shows how fragile that assumption is. “Less than half (44.3%) of all agent‑produced code survives into user commits (Table 3)” [1] . Moreover, “Overall, users push back after 39% of turns, regardless of coding mode” [1] , indicating frequent manual corrections and interruptions even when the agent is nominally competent. A complementary study of computer‑use agents confirms the phenomenon on a different front.…