Why I used three different critic roles instead of one (and what the eval taught me) I built Crucible over a weekend: three specialized critic agents that audit any LLM output in parallel, an adjudicator that synthesizes their critiques into a confidence-scored verdict, and an eval harness that measures whether the whole thing actually works better than just asking a single model to check itself. Here is what I learned, including the part where the honest answer is "not as much as I hoped." The problem: a model cannot reliably audit its own blind spots When a language model generates output, it has already committed to a direction. Ask it to self-review and it will often ratify the same confident mistake it just made, not because it is lazy, but because self-review activates the same internal heuristics that produced the error.…