Menu

Post image 1
Post image 2
1 / 2
0

Why ML accuracy numbers are unfalsifiable, and what a 1287-line Python tool does about it" published: false

DEV Community·sk8ordie84·about 1 month ago
#C5IWmo8s
Reading 0:00
15s threshold

A few weeks ago I was reading a model card for an open-weight code model. It claimed pass@1 = 67% on HumanEval. I tried to reproduce it. I got 54% . I went back to the model card. The metric was named, the dataset was named, the model checkpoint hash was published. Everything looked reproducible. Except: which version of HumanEval? The original 164 problems, or the de-contaminated 161? What temperature? What seed for nucleus sampling? What was the threshold the team committed to before they ran the eval, and how do I know the published 67% is not the best of three runs at three temperatures? I read the paper. I read the README. I read the eval harness source. I could not answer any of those questions from the published artifacts. I could only ask the authors, and they could only tell me what they remembered. And I had no way to distinguish what they remembered from what they wished they had done. This is not a problem about that specific model card or those specific authors.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More