Menu

A reader comment made me realise I'd only solved half the problem
πŸ“°
0

A reader comment made me realise I'd only solved half the problem

DEV CommunityΒ·KrissΒ·about 1 month ago
#GozOkgZ1
#devops#monitoring#count#nothing#fullscreen#article
Reading 0:00
15s threshold

A reader comment made me realise I'd only solved half the problem Last month I wrote about the cron job failure mode nobody talks about: the job that doesn't die, it just drags. The short version: a nightly ETL job at a previous employer took four hours instead of forty minutes for six days before anyone noticed. It ran. It completed. It exited zero. Every dashboard showed green. Downstream data was silently wrong. The fix I described was duration anomaly detection β€” once you have a few weeks of run history, you know what "normal" looks like. A job that takes 4x its baseline is a signal even if it succeeded. I built DeadManCheck partly because I couldn't find a tool that combined silence detection with duration tracking. The article got some traction. Then someone left a comment that stopped me in my tracks - https://dev.to/krissv/the-cron-job-failure-mode-nobody-talks-about-3p1a The comment The failure mode I keep seeing: the job runs, logs "complete," and the output silently goes nowhere. No error.…

Continue reading β€” create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More