Introduction How do you know if your AI-generated translation is actually good? Traditional metrics like BLEU scores measure word overlap — but they miss fluency, context, and cultural nuance entirely. A translation can score well on BLEU and still read like gibberish to a native speaker. This is where LLM-as-a-Judge comes in — using a large language model to evaluate the quality of another model's output. In this tutorial, we'll build a practical evaluation pipeline that scores translation quality across multiple dimensions using Claude as the judge. By the end, you'll have a working system you can plug into any translation workflow. What We're Building A Python-based evaluation pipeline that: Accepts a source text + translated output. Sends both to an LLM judge with a structured scoring prompt. Returns scores for fluency, accuracy, and cultural appropriateness. Logs results to a simple JSON file for tracking over time.…