If you cannot measure it, you cannot route it. Why offline evaluation is the difference between a code reviewer that improves over time and one the team dismisses within a sprint. Chat evaluations are vibes-based: thumbs-up on "was this helpful?" measured against no particular ground truth. Code review needs something stricter. A reviewer that flags five real bugs and one bogus warning is useful; one that flags one real bug and five bogus warnings is dismissed within a sprint. Offline evaluation answers the question before the reviewer ships. It tells you which model to route a given change to, when to escalate, and whether the system is getting better or worse over time. Without it, every routing decision is a guess. Building the evaluation set Start with past pull requests that carry human accept/reject outcomes. This is your ground truth.…