Menu

Post image 1
Post image 2
1 / 2
0

War Story: How a Kubernetes 1.32 Node OOM Kill Cascaded Into a 2-Hour Outage for Our Video Streaming Service

DEV Community·ANKUSH CHOUDHARY JOHAL·about 1 month ago
#cqECFYVx
#kubernetes#tip#story#node#memory#return
Reading 0:00
15s threshold

At 19:42 UTC on March 14, 2024, our video streaming service serving 4.2 million concurrent viewers lost 92% of traffic in 11 minutes, triggered by a single Kubernetes 1.32 node OOM kill that cascaded across 18 availability zones. 🔴 Live Ecosystem Stats ⭐ kubernetes/kubernetes — 122,018 stars, 42,991 forks Data pulled live from GitHub and npm. 📡 Hacker News Top Stories Right Now Ti-84 Evo (326 points) Artemis II Photo Timeline (81 points) Good developers learn to program. Most courses teach a language (37 points) New research suggests people can communicate and practice skills while dreaming (261 points) The smelly baby problem (119 points) Key Insights Kubernetes 1.32's kubelet memory accounting for sidecar containers under cgroups v2 underreports RSS by 22% in high-throughput network workloads kubelet v1.32.0, containerd 1.7.12, cgroups v2.0.3 on Ubuntu 22.04 LTS nodes Implementing pod-level memory limits with 15% headroom reduced OOM-related node failures by 94% and saved $27k/month in SLA penalties…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More