IBM's Granite 4.1 release exposes something the industry keeps forgetting: parameter count is a vanity metric. Their 8B instruct model matches or beats their own previous 32B-A9B MoE variant. Same capabilities, one-fourth the size. Most teams still chase the pre-training lottery. Dump more tokens, add more layers, scale the cluster. Granite 4.1 took the opposite path. Fifteen trillion tokens, yes, but filtered through five distinct phases where data quality progressively tightened like a vise. The architecture isn't revolutionary. Grouped Query Attention, RoPE, SwiGLU, RMSNorm. Standard components assembled competently. Where Granite diverges is the post-training stack. Supervised fine-tuning on 4.1 million samples, each scored through an LLM-as-Judge pipeline. Then a multi-stage RL pipeline using on-policy GRPO with DAPO loss. The long-context extension to 512K tokens is equally methodical. Staged expansion: 32K, then 128K, then 512K. A model merge after each stage to preserve short-context performance.…