A few weeks ago I was reading a model card for an open-weight code model. It claimed pass@1 = 67% on HumanEval. I tried to reproduce it. I got 54% . I went back to the model card. The metric was named, the dataset was named, the model checkpoint hash was published. Everything looked reproducible. Except: which version of HumanEval? The original 164 problems, or the de-contaminated 161? What temperature? What seed for nucleus sampling? What was the threshold the team committed to before they ran the eval, and how do I know the published 67% is not the best of three runs at three temperatures? I read the paper. I read the README. I read the eval harness source. I could not answer any of those questions from the published artifacts. I could only ask the authors, and they could only tell me what they remembered. And I had no way to distinguish what they remembered from what they wished they had done. This is not a problem about that specific model card or those specific authors.…