Menu

Post image 1
Post image 2
1 / 2
0

Day 9/60: Alerting Strategies -- Production Engineering

DEV Community·Naveen Karasu·20 days ago
#7df7QgtO
#sre#prometheus#devops#tutorial#alert#rate
Reading 0:00
15s threshold

Day 9/60: Alerting Strategies -- Production Engineering 60 Day Production Engineering Challenge Alert fatigue is the number one reason on-call rotations burn people out. Today I am covering the strategies that cut noise while keeping signal. Symptom-Based Alerting with PromQL Page on what users feel, not what servers report internally. Here is a burn rate alert that fires when your error budget is burning at 14.4x the allowed rate: # Critical burn rate: will exhaust monthly budget in 1 hour ( sum(rate(http_requests_total{code=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (14.4 * 0.001) and ( sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (14.4 * 0.001) Enter fullscreen mode Exit fullscreen mode The dual window (1h AND 5m) means you only page when the problem has statistical significance AND is actively happening right now. Alertmanager Inhibition Rules When a node dies, you do not need fifty alerts for every pod that was on it.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More