Beginner intermediate question. "Set the random seed" is the textbook answer, but in practice that only fixes one variable.
What actually breaks reproducibility in your experience?
- Different CUDA versions (already a known issue)
- Stochastic libraries (cudnn determinism flags)
- Data version drift (dataset got updated, you didn't notice)
- Threshold/metric definition shift (someone redefined "accuracy" in code)
- Non-determinism in eval harness itself
Building a mental model of which of these matters most for which kind of work.