Claude scored higher. Llama felt better in the browser. The harder part was figuring out which one actually mattered. One AI model scored 99. I still voted for the one that scored 95. That should have made no sense. The higher-scoring build was technically cleaner, passed almost every automated evaluation check, and looked like the obvious winner on paper. The lower-scoring one came back with flagged quality issues, accessibility deductions, and enough small implementation compromises that it should have been easy to dismiss. And yet after using both side by side, I trusted the lower-scoring app more. That contradiction ended up being the most useful part of the exercise, because it exposed something developers are going to run into increasingly often as AI-generated software becomes easier and easier to produce: “looks good,” “scores good,” and “feels right” are three different judgments, and they do not always point to the same winner.…