Postmortem: How Kubernetes 1.32 Service Mesh Latency Caused Timeout Errors Incident Date: October 15, 2024 Duration: 2 hours 17 minutes (14:02 UTC – 16:19 UTC) Impact Scope: Production Kubernetes cluster, 32% of user-facing requests affected Executive Summary On October 15, 2024, our production environment running Kubernetes 1.32 with Istio 1.21 service mesh experienced a severe latency regression that triggered widespread HTTP 504 timeout errors. The incident lasted 2 hours 17 minutes, peaking at 32% error rate for east-west service traffic. Root cause was traced to a kube-proxy iptables rule ordering regression in Kubernetes 1.32 that conflicted with Istio sidecar traffic interception, adding 800ms+ latency to service-to-service calls. Immediate rollback to Kubernetes 1.31 resolved the issue, and a permanent patch was applied to the 1.32 control plane within 48 hours.…