You ran ollama pull and saw phi4:Q4_K_M. The docs say it's a quantized version. The model page shows the file size. Neither tells you which one to pull or why the difference matters. Here's what the naming actually means. The Q Number is Bits Per Weight LLM quantization is a method of compressing model weights from full floating-point precision down to lower bit representations so the model fits in less VRAM without destroying output quality. A 7B model at FP16 needs roughly 14GB of VRAM. At Q4_K_M, that same model loads in 4 to 4.5GB. That's not a marginal savings. That's the difference between a model loading at all and refusing to load entirely. What Each Level Delivers Q2 / Q3 — Dramatic VRAM savings, significant quality loss. Q3 is not meaningfully better than Q2 for most tasks. If a model only fits at Q3, the better move is a smaller model at Q4. Q4_K_M — The working standard. Strong output quality across drafting, summarization, coding, and reasoning.…