Menu

Post image 1
Post image 2
1 / 2
0

Postmortem: An Ansible 2.16 Playbook Error Caused a 3-Hour Outage Across Our On-Prem Data Center

DEV Community·ANKUSH CHOUDHARY JOHAL·28 days ago
#POgswcId
Reading 0:00
15s threshold

At 14:17 UTC on October 12, 2024, a single Ansible 2.16.1 playbook task triggered a cascading failure that took down 87% of our on-prem data center workloads for 3 hours and 12 minutes, costing an estimated $142,000 in SLA penalties and lost revenue. 📡 Hacker News Top Stories Right Now How OpenAI delivers low-latency voice AI at scale (253 points) I am worried about Bun (387 points) Talking to strangers at the gym (1114 points) Securing a DoD contractor: Finding a multi-tenant authorization vulnerability (162 points) Agent Skills (77 points) Key Insights Ansible 2.16’s new default gather\_facts: smart\ behavior caused 42% slower fact gathering on RHEL 8.8 nodes, triggering timeouts. Ansible 2.16.1 introduced a regression in the yum\ module’s update\_cache\ parameter when used with become: yes\ and check\_mode: no\ . The outage cost $142k in SLA penalties, with 12,400 customer support tickets filed during the window.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More