I ran 6 local Ollama models against strict code-gen prompts, then re-ran the most discriminating prompt 3 times each. The single-shot winner was unstable, and the actual best was a general-purpose model the single-shot run had ranked 5th. I've been picking models for a local Ollama pool that handles small, well-scoped coding chores delegated from a main agent. Before cabling routing rules into the agent, I wanted a defensible answer to "which model for which task family." So I built a tiny benchmark. The interesting part wasn't the ranking. It was that the ranking changed after I added variance testing. TL;DR I ran 6 models against 3 strict, single-function prompts (auto-graded by I/O equivalence, 32 test cases). Then I ran the most discriminating prompt 3 times on every model. Findings: Single-shot ranking placed qwen3.5:9b at the top and gemma4:latest 5th. Post-variance, gemma4:latest was the only byte-stable perfect model. qwen3.5:9b produced byte-identical buggy code in 2 of 3 runs at temperature=0.2 .…