I ran 10 games between two AI agents. Agent v3 went 5-5 against Agent v1. I reported "v3 ties v1, no measurable improvement, don't merge." That conclusion was wrong. Not because v3 was secretly better or worse, but because 10 games told me almost nothing at all. Here's the math I should have done first. The win-rate trap The obvious metric for comparing two agents is win rate. Agent A beats Agent B 50% of the time? They're even. 70%? A is better. Simple. Except win rate has a confidence interval, and at small N that interval is enormous. The Wilson score interval gives a reasonable bound for binary outcomes: import math def wilson_interval ( wins , total , z = 1.96 ): """ 95% confidence interval for true win probability. """ if total == 0 : return ( 0.0 , 1.0 ) p = wins / total denom = 1 + z ** 2 / total center = ( p + z ** 2 / ( 2 * total )) / denom spread = z * math .…