agent system seriously fail in production, it wasn’t dramatic. There was no crash. No error message. The system just kept running and producing outputs that looked reasonable until someone actually read them carefully enough to notice something was off. When we decided to look into it, it took us two days’ worth of debugging to figure out what was going on. Funny enough, the model wasn’t hallucinating, and the input-output tools were delivering the correct results. The problem, when we finally found it, was architectural. The model and the tools were set up correctly, but the idea was that reasoning would tie the whole thing together, which, as you would guess, obviously failed. Turns out reasoning does not do that sort of thing. That experience is what I keep coming back to when I think about why so many AI agents that work in demos don’t really survive real-world use. It’s not a capability problem. It’s an architectural one.…