LangChain jumped from 52.8% to 66.5% on Terminal Bench 2.0 by constraining their agent, not upgrading the model. Running at maximum reasoning budget actually scored worse . Three data points prove it: freedom is the enemy of AI agent reliability. Two approaches. Same model. Different results: # Approach A: Give the agent more freedom β Upgrade model, add more tools, increase context window β Remove guardrails so it "moves faster" β Result: unpredictable, rolls back 3x per session # Approach B: Give the agent more constraints β Same model, same tools, same context β Add: verification loop, compute budget, context injection β Result: 52.8% β 66.5% on Terminal Bench 2.0 ( LangChain, 2026 ) Enter fullscreen mode Exit fullscreen mode Every time a team complains about Claude Code "doing the wrong thing," I ask the same question: what stopped it from doing that? The answer is always nothing . The agent had the capability. Nothing prevented the action. The instinct is to want a smarter model.β¦