I recently went through a fairly painful PostgreSQL 17 migration involving async replication, WAL retention issues, replica rebuilds, and long-running restore operations getting interrupted by timeout/configuration problems. I wrote a detailed postmortem covering: how replication lag gradually snowballed WAL recycling failure modes why conservative Docker/container limits backfired timeout settings that caused silent failures recovery/rebuild strategy and the operational lessons I took away from it Would genuinely appreciate feedback from people with more PostgreSQL operational experience.…