At 3:17 AM on a Tuesday in Q3 2024, our PagerDuty alert for kubelet_pod_startup_latency_seconds p99 > 5s fired across 12 on-call engineers. We were running 9,872 pods on a 120-node AWS Graviton5 (c8g.24xlarge) Kubernetes 1.32 cluster, and the control plane was melting. Three hours later, we hit 10,412 pods, reduced p99 startup latency by 62%, and cut our monthly AWS bill by $42k compared to our previous x86-based cluster. Here's every mistake we made, every fix we shipped, and every line of code we wrote to get there. 🔴 Live Ecosystem Stats ⭐ kubernetes/kubernetes — 122,065 stars, 42,989 forks Data pulled live from GitHub and npm.…