TL;DR: Single-model code review has correlated blind spots. A panel of frontier models from different vendor families catches more, but only if a final verification step throws out the hallucinations the panel converged on. Crucible is a Claude Code skill that does both for about $0.30 a run. If GPT misses a bug, the runner-up GPT-class model usually misses it too. Same training data, same blind spots. The fix isn't a bigger model. It's a panel of structurally different ones. I built Crucible to test this. It's a Claude Code skill that walks a codebase file by file, runs each file through four frontier models from different vendor families, then has Claude verify the findings against the actual source before showing them to me. This post is about the orchestration tricks that made it actually work. ## The panel Four models, four families. As of writing, the default panel is: DeepSeek V4-Pro Google Gemini 3.1 Pro Moonshot Kimi K2.6 MiniMax M2.7 Family diversity matters more than raw benchmark scores.…