Every AI agent I have shipped has had the same lifecycle: it works well in testing, degrades quietly in production, and by the time someone notices, there are three weeks of bad outputs to explain. The fix most teams reach for is more evals and more monitoring. That helps, but it's still reactive. You're still waiting for something to break before you act. I wanted to build something different: a pipeline that doesn't just detect failures but closes the loop automatically. What "Self-Improving" Actually Means Here I'm not talking about reinforcement learning or any kind of model fine-tuning. The loop I built works at the prompt and pipeline level. It detects when outputs fail evaluation thresholds, traces which component caused the failure, generates a fix, tests that fix against the same inputs, and deploys it if the metrics improve. The full cycle runs without manual intervention. A human reviews the summary, not the individual failures. That structure means the evaluator is not just a reporting layer.…