Menu

Post image 1
Post image 2
1 / 2
0

Postmortem: How a Corrupted Node Modules Folder Caused 3-Hour Outage for Our CI Pipeline

DEV Community·ANKUSH CHOUDHARY JOHAL·28 days ago
#iaQaWxRz
Reading 0:00
15s threshold

Postmortem: How a Corrupted Node Modules Folder Caused 3-Hour Outage for Our CI Pipeline Published: October 26, 2024 | Author: DevOps Team | 5 min read Executive Summary On October 24, 2024, our team experienced a 3-hour, 12-minute outage of our primary CI/CD pipeline, impacting 47 active pull requests and delaying 3 production releases. The root cause was identified as a corrupted node_modules directory on our shared CI runner, triggered by an incomplete npm install process during a high-concurrency job spike. Timeline of Events All times are in UTC: 14:02 – CI pipeline begins failing for all new jobs with MODULE_NOT_FOUND errors for core dependencies like express and jest . 14:07 – On-call engineer acknowledges alert, starts investigating failed job logs. 14:15 – Initial hypothesis: Recent dependency update to @company/utils v2.1.0 broke compatibility. Rollback attempt fails. 14:28 – Engineer notices node_modules on the primary shared runner has 0-byte files for 12+ dependencies.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More