Menu

Post image 1
Post image 2
1 / 2
0

Postmortem: How a Kubernetes 1.32 Service Mesh Latency Caused Timeout Errors

DEV Community·ANKUSH CHOUDHARY JOHAL·30 days ago
#dHxVJsHn
Reading 0:00
15s threshold

Postmortem: How Kubernetes 1.32 Service Mesh Latency Caused Timeout Errors Incident Date: October 15, 2024 Duration: 2 hours 17 minutes (14:02 UTC – 16:19 UTC) Impact Scope: Production Kubernetes cluster, 32% of user-facing requests affected Executive Summary On October 15, 2024, our production environment running Kubernetes 1.32 with Istio 1.21 service mesh experienced a severe latency regression that triggered widespread HTTP 504 timeout errors. The incident lasted 2 hours 17 minutes, peaking at 32% error rate for east-west service traffic. Root cause was traced to a kube-proxy iptables rule ordering regression in Kubernetes 1.32 that conflicted with Istio sidecar traffic interception, adding 800ms+ latency to service-to-service calls. Immediate rollback to Kubernetes 1.31 resolved the issue, and a permanent patch was applied to the 1.32 control plane within 48 hours.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More