Originally published at theculprit.ai/blog/sql-primitives-for-incident-split-merge . Every alert-correlation system gets things wrong. The interesting question is what the on-call engineer can do about it at 2 a.m. The bad answer is: nothing. The system grouped seven events into one incident; six of them are the database connection-pool storm and the seventh is a totally unrelated TLS-handshake failure that happens to share some token overlap with the rest. The on-call sees one incident in the dashboard, ack's it, fixes the connection pool, and goes back to sleep. The TLS failure quietly stops alerting because it's already attached to a "resolved" incident, and customers find it for you in the morning. The slightly less-bad answer is: file a ticket with the vendor. Wait two weeks. They tweak a threshold somewhere. The same shape of mis-clustering happens again on a different pair of unrelated events.…