TL;DR: Our SDXL LoRA fine-tune for a Photoroom product photography model trained for six days while silently corrupting its adapter weights. The cause was bf16 gradient accumulation interacting badly with a custom adapter init we'd ported from a paper. Eval scores stayed in the same range the whole time, which is why nobody noticed. The setup We train SDXL LoRAs for product photography categories at Photoroom. Bottles, packaged food, soft goods. Each LoRA is 192MB. Training stack: PyTorch 2.3, bf16 mixed precision, gradient accumulation across 8 steps, A100 80GBs. The LoRA init follows a small modification of the OFT paper for better stability on small datasets. To be precise, we orthogonalize the down-projection before training begins, then let the up-projection drift freely. This had been working for nine months. What broke Six days into a 7-day run, our automated CLIPScore check started showing variance that was technically inside our acceptance band but trending the wrong way.…