A production pipeline's failure mode is rarely a single, obvious crash. You see partial runs that produced artifacts with mixed lineage, long-running jobs killed by preemption, hidden silent data-corruption in artifact uploads, and engineers spending days reconstructing a single lost experiment rather than iterating on models. Contents Why ML training pipelines break in production Design for restartability: idempotency, retries, and checkpointing Treat preemption like an expected signal, not an exception Observability-first: metrics, logs, traces, and automated recovery Practical application: checklist and example workflows Why ML training pipelines break in production Failures fall into repeatable categories you must design against: Resource preemption and spot/spot-like capacity. Clouds expose cheaper, interruptible compute (Spot, Preemptible).…