Two SQL primitives for when alert clustering gets it wrong

1 / 2

Two SQL primitives for when alert clustering gets it wrong

DEV Community·Stella Lin·24 days ago

#r9BthhXj

#aiops #postgres #sre #incidentmanagement #incident #split

Reading 0:00

15s threshold

Originally published at theculprit.ai/blog/sql-primitives-for-incident-split-merge . Every alert-correlation system gets things wrong. The interesting question is what the on-call engineer can do about it at 2 a.m. The bad answer is: nothing. The system grouped seven events into one incident; six of them are the database connection-pool storm and the seventh is a totally unrelated TLS-handshake failure that happens to share some token overlap with the rest. The on-call sees one incident in the dashboard, ack's it, fixes the connection pool, and goes back to sleep. The TLS failure quietly stops alerting because it's already attached to a "resolved" incident, and customers find it for you in the morning. The slightly less-bad answer is: file a ticket with the vendor. Wait two weeks. They tweak a threshold somewhere. The same shape of mis-clustering happens again on a different pair of unrelated events.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Two SQL primitives for when alert clustering gets it wrong