Dynamic batching for Encoder-Decoder MT training or generation when long sequence caps the batch size [P] I built a small pytorch sampler called **dynabatch** after facing this specific batching issue while fine tuning a NLLB-200 600M model. Training on RTX 5090, the largest fixed batch size I could use was 8, any bigger leads to OOM. While training and monitoring using **nvidia-smi ,** it looked like only a few batches were actually stressing the GPU. A lot of the time utilization was much lower. My guess was that fixed batch size was being dictated by the longests source/target examples, while the shorter examples probably had room for more samples per batch. So I tried to make the batch size change as the sequence lengths changed.…