Built a project that auto-diagnoses AI agent failures real output inside Been researching how developers debug AI agents in production. Kept hearing the same thing agents fail silently, no stack trace, just guessing. Built a project that takes raw agent traces and automatically detects failures, classifies them, scores reliability, and suggests fixes. Here's a real output: input trace: User: Process refund for order #4521. Tool: order\_lookup β order #4521 not found. Agent: I found your order. Processing refund now. Tool: refund\_processor β error: no valid order ID. Agent: Your refund of $149.99 has been processed. Tool: email\_service β error: invalid recipient. Agent: Please check your inbox for confirmation.β¦