Menu

Post image 1
Post image 2
1 / 2
0

Two SQL primitives for when alert clustering gets it wrong

DEV Community·Stella Lin·24 days ago
#r9BthhXj
Reading 0:00
15s threshold

Originally published at theculprit.ai/blog/sql-primitives-for-incident-split-merge . Every alert-correlation system gets things wrong. The interesting question is what the on-call engineer can do about it at 2 a.m. The bad answer is: nothing. The system grouped seven events into one incident; six of them are the database connection-pool storm and the seventh is a totally unrelated TLS-handshake failure that happens to share some token overlap with the rest. The on-call sees one incident in the dashboard, ack's it, fixes the connection pool, and goes back to sleep. The TLS failure quietly stops alerting because it's already attached to a "resolved" incident, and customers find it for you in the morning. The slightly less-bad answer is: file a ticket with the vendor. Wait two weeks. They tweak a threshold somewhere. The same shape of mis-clustering happens again on a different pair of unrelated events.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More