Menu

Post image 1
Post image 2
1 / 2
0

LLM Quantization Explained: What Q4, Q5, and Q8 Actually Mean for Your GPU

DEV Community·EngineeredAI·25 days ago
#SXssevjl
Reading 0:00
15s threshold

You ran ollama pull and saw phi4:Q4_K_M. The docs say it's a quantized version. The model page shows the file size. Neither tells you which one to pull or why the difference matters. Here's what the naming actually means. The Q Number is Bits Per Weight LLM quantization is a method of compressing model weights from full floating-point precision down to lower bit representations so the model fits in less VRAM without destroying output quality. A 7B model at FP16 needs roughly 14GB of VRAM. At Q4_K_M, that same model loads in 4 to 4.5GB. That's not a marginal savings. That's the difference between a model loading at all and refusing to load entirely. What Each Level Delivers Q2 / Q3 — Dramatic VRAM savings, significant quality loss. Q3 is not meaningfully better than Q2 for most tasks. If a model only fits at Q3, the better move is a smaller model at Q4. Q4_K_M — The working standard. Strong output quality across drafting, summarization, coding, and reasoning.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More