How human feedback actually steers TTS fine-tuning Notes on the iteration loop we ran while fine-tuning F5-TTS and StyleTTS2 on a small Northern English corpus. The headline finding is that the listening test isn't optional polish at the end — it's the only measurement that catches the failure modes that matter, and each round of listening produces specific phonetic observations that map to specific engineering decisions. This is a write-up of the methodology, with the concrete examples that forced each decision. The loop ┌────────────────────────┐ │ render passage │ │ (baseline + ft) │ └──────────┬─────────────┘ ▼ ┌────────────────────────┐ a feature is "right" if a native │ human listens against │ speaker recognises it. Record both │ marker list (BATH, │ ◀───── what's working AND what's broken; │ FOOT-STRUT, …) │ both are signal. └──────────┬─────────────┘ ▼ ┌────────────────────────┐ translate audible features │ diagnose: why is the │ to training-side cause: │ output the way it is?…