I built a CLI that hashes your ML accuracy claims before the experiment runs Last month, a customer told me our model's accuracy on their data was 71%, not the 94% we had shipped on the landing page. I went back to the eval notebook. The threshold was still 0.94. The test set was named the same thing. But somewhere in the last three weeks, somebody had "refreshed" the test set, somebody else had tightened the metric definition, and the original 94% was now unreproducible. Not anybody's fault, exactly — just nobody had written down the contract before running the experiment. That night I started building falsify. Three days later I shipped it. This post is what I built, why I built it that small, and the one Python function that does most of the work. The problem in one sentence If you can change the spec after seeing the result, your accuracy claim is not falsifiable. And if it is not falsifiable, it is not really a claim — it is marketing.…