The bf16 grad accumulator that killed our SDXL LoRA training

1 / 2

The bf16 grad accumulator that killed our SDXL LoRA training

DEV Community: pytorch·Elise Moreau·3 days ago

#uMHcMjEK

#dev #training #finite #gradient #grad #article

Reading 0:00

15s threshold

TL;DR: Our SDXL LoRA fine-tune for a Photoroom product photography model trained for six days while silently corrupting its adapter weights. The cause was bf16 gradient accumulation interacting badly with a custom adapter init we'd ported from a paper. Eval scores stayed in the same range the whole time, which is why nobody noticed. The setup We train SDXL LoRAs for product photography categories at Photoroom. Bottles, packaged food, soft goods. Each LoRA is 192MB. Training stack: PyTorch 2.3, bf16 mixed precision, gradient accumulation across 8 steps, A100 80GBs. The LoRA init follows a small modification of the OFT paper for better stability on small datasets. To be precise, we orthogonalize the down-projection before training begins, then let the up-projection drift freely. This had been working for nine months. What broke Six days into a 7-day run, our automated CLIPScore check started showing variance that was technically inside our acceptance band but trending the wrong way.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

The bf16 grad accumulator that killed our SDXL LoRA training