A colleague's paired_bootstrap function resamples one set of 48 task indices and applies it to both the trained LoRA scores and the baseline scores. The question: what mathematical property makes that the correct procedure — and would an unpaired bootstrap have changed the reviewer-facing conclusion? The short answer: pairing is correct by experimental design . When the two score vectors have positive covariance, pairing reduces the model-based standard error; in this specific data the correlation is near-zero (r = 0.167), so the paired and unpaired bootstrap CIs are practically identical — and neither changes the reviewer-facing conclusion. Here is why, from first principles. The experimental design justification: why pairing is valid at all The 48 held-out tasks were not drawn independently for the baseline and then re-drawn independently for the trained LoRA. The same 48 tasks were evaluated under both systems.…