War Story: Running 10k Kubernetes 1.32 Pods on AWS Graviton5 – Lessons Learned

1 / 2

War Story: Running 10k Kubernetes 1.32 Pods on AWS Graviton5 – Lessons Learned

DEV Community·ANKUSH CHOUDHARY JOHAL·28 days ago

#lZsNwwqK

#tip #story #running #kubernetes #pods #node

Reading 0:00

15s threshold

At 3:17 AM on a Tuesday in Q3 2024, our PagerDuty alert for kubelet_pod_startup_latency_seconds p99 > 5s fired across 12 on-call engineers. We were running 9,872 pods on a 120-node AWS Graviton5 (c8g.24xlarge) Kubernetes 1.32 cluster, and the control plane was melting. Three hours later, we hit 10,412 pods, reduced p99 startup latency by 62%, and cut our monthly AWS bill by $42k compared to our previous x86-based cluster. Here's every mistake we made, every fix we shipped, and every line of code we wrote to get there. 🔴 Live Ecosystem Stats ⭐ kubernetes/kubernetes — 122,065 stars, 42,989 forks Data pulled live from GitHub and npm.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

War Story: Running 10k Kubernetes 1.32 Pods on AWS Graviton5 – Lessons Learned