I injected three simultaneous faults into a multi-region serverless payment ledger and let AWS DevOps Agent investigate. One alarm, three root causes, 7 minutes to a complete finding. Here is exactly what it found.
Runbooks cover preparation. Post-mortems cover recovery. The investigation phase, correlating signals and finding the root cause while the primary region is actively down, still falls on whoever is on call.…