I’m dealing with a few K8s CronJobs that are important, but not all of them are “wake someone up at 3 a.m.” important.
Some fail once and recover on the next run, some get delayed, some quietly stop being useful long before they technically fail. I’m trying to find a sane line between “ignore it” and “page for every hiccup.”
If you run a lot of CronJobs, how do you decide what becomes a ticket, what becomes an alert, and what becomes a page?