Postmortem: Cloud Run Instance Exhaustion Caused 500 Error Spike Executive Summary On May 15, 2024, a production Cloud Run-hosted user authentication service experienced severe instance exhaustion, leading to a 72% spike in 500 Internal Server Error responses over a 45-minute window. The incident impacted ~12,000 user requests and degraded performance for three downstream high-traffic services. No data loss or corruption was reported. Incident Impact The incident lasted from 14:22 UTC to 15:07 UTC on May 15, 2024 (total 45 minutes). Key impact metrics: Baseline 500 error rate: 0.2% → Peak error rate: 72% Affected services: User Auth API, Payment Processing Endpoint, Dashboard Data Service Total failed requests: ~12,000 No customer data loss or financial impact reported Root Cause Analysis Investigation revealed the incident was caused by a combination of four compounding factors: Uncommunicated traffic surge: An unannounced marketing campaign launched at 14:20 UTC drove 2x normal traffic to the service, with…