Menu

Post image 1
Post image 2
1 / 2
0

Postmortem: A Corrupted Loki 2.10 Log Store Caused 3 Days of Lost Debug Data

DEV Community·ANKUSH CHOUDHARY JOHAL·28 days ago
#rNkh6lj9
Reading 0:00
15s threshold

Postmortem: A Corrupted Loki 2.10 Log Store Caused 3 Days of Lost Debug Data Overview On October 12, 2024, at 09:15 UTC, our observability team detected anomalies in the Loki 2.10 log store cluster serving production debug data. By the time mitigation steps were fully effective, 3 full days of debug log data (October 12–14) had been corrupted and was unrecoverable, impacting incident response and root cause analysis for 14 concurrent production incidents. Impact The corrupted log store affected all production services that write debug-level logs to the Loki 2.10 cluster, including: 142 microservices across 3 cloud regions CI/CD pipeline debug logs for 217 failed deployments Customer support ticket debug traces for 892 high-priority tickets Total unrecoverable data loss: ~4.2 TB of debug logs. No customer-facing services were disrupted, as debug logs are non-critical for production runtime, but engineering velocity dropped by 62% due to lack of debug context for incident resolution.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More