Menu

Post image 1
Post image 2
1 / 2
0

AI agent logs expose reproducibility gaps

DEV Community·Papers Mache·26 days ago
#6AjTuo5V
Reading 0:00
15s threshold

Across dozens of repeated executions, the same autonomous agent can flip from success to failure by a noticeable margin. The swing is not uniform; it widens dramatically on web‑navigation , exposing a gap between headline scores and day‑to‑day reliability. Historically, progress reports have leaned on single‑run leaderboards: a model that solves a benchmark once is declared “state‑of‑the‑art.” Few works have logged the entire interaction history of developers or systematically replayed the same task under identical conditions. The SWE‑chat corpus of 6 000 real‑world coding sessions shows how fragile that assumption is. “Less than half (44.3%) of all agent‑produced code survives into user commits (Table 3)” [1] . Moreover, “Overall, users push back after 39% of turns, regardless of coding mode” [1] , indicating frequent manual corrections and interruptions even when the agent is nominally competent. A complementary study of computer‑use agents confirms the phenomenon on a different front.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More