Why ML accuracy numbers are unfalsifiable, and what a 1287-line Python tool does about it" publis…

1 / 2

Why ML accuracy numbers are unfalsifiable, and what a 1287-line Python tool does about it" published: false

DEV Community·sk8ordie84·about 1 month ago

#C5IWmo8s

#machinelearning #ai #python #hash #model #fullscreen

Reading 0:00

15s threshold

A few weeks ago I was reading a model card for an open-weight code model. It claimed pass@1 = 67% on HumanEval. I tried to reproduce it. I got 54% . I went back to the model card. The metric was named, the dataset was named, the model checkpoint hash was published. Everything looked reproducible. Except: which version of HumanEval? The original 164 problems, or the de-contaminated 161? What temperature? What seed for nucleus sampling? What was the threshold the team committed to before they ran the eval, and how do I know the published 67% is not the best of three runs at three temperatures? I read the paper. I read the README. I read the eval harness source. I could not answer any of those questions from the published artifacts. I could only ask the authors, and they could only tell me what they remembered. And I had no way to distinguish what they remembered from what they wished they had done. This is not a problem about that specific model card or those specific authors.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Why ML accuracy numbers are unfalsifiable, and what a 1287-line Python tool does about it" published: false