Wednesday, 2pm. Users say the agent is "slow." You check the logs: [tool] search_knowledge_base called [tool] search_knowledge_base returned: null [tool] search_knowledge_base called [tool] search_knowledge_base returned: null [tool] search_knowledge_base called [tool] search_knowledge_base returned: null [tool] search_knowledge_base called [tool] search_knowledge_base returned: null Enter fullscreen mode Exit fullscreen mode Four identical calls. Same input. Same null output. Your agent didn't know how to interpret silence, so it asked again. And again. You restart the process. It goes away. You blame the API provider. You move on. The tool wasn't broken. Your agent was stuck. Your logs had no way to tell you. Here's what you were missing: Alerts fired: ALERT retry_loop ALERT failure_rate Enter fullscreen mode Exit fullscreen mode retry_loop fires when the same tool appears 4+ times in the last 6 spans. failure_rate fires when more than 20% of recent spans error out. Both rules are on by default. No config.…