1.0 Introduction When professional creatives evaluate AI-generated work, their judgments produce two distinct signals. The first is convergence: evaluators agree on what works, revealing shared best practices like readable typography, functional layout, and correct visual hierarchy. The second is divergence: evaluators disagree, and that disagreement reflects genuine differences in taste, aesthetic direction, and creative intent. Most AI benchmarks treat the second signal as noise to be resolved. This paper proposes a framework for measuring both. This distinction matters because creative work has no ground truth. The dimensions on which experts disagree — aesthetic direction, mood, conceptual risk — are not reducible to miscalibration or error [1][2]. Standard evaluation approaches, including majority voting, adjudication, and gold-standard reconciliation, treat evaluator disagreement as something to resolve [3][4]. These methods work where labels have objective answers.…