There is a class of AWS incident I have started calling the "everything looked fine in testing" failure. The pattern is consistent. You design a serverless API. Lambda function with sensible defaults, wired through API Gateway, pointing at DynamoDB. You test it in dev throughout the week. Latency is acceptable. Costs track to plan. Then a campaign drops, or a new enterprise customer brings their three thousand users on day one, and your traffic goes from 300 RPS to 3,000 RPS in under a minute. Lambda, which has never had to spin up more than a dozen concurrent environments at once, is now being asked to handle a hundred. Cold starts accumulate. p99 latency goes from 80ms to 2,400ms. API Gateway timeout windows close on in-flight requests. Customers see errors. The Slack channel fires. You spend a Saturday explaining to your CTO why the architecture that "passed all our tests" just fell over under a load it should have anticipated. I have been in this situation. Not once.β¦