Self-attention already helps a transformer understand relationships between words using Query, Key, and Value. But there’s a problem. One attention mechanism usually ends up focusing on a limited kind of relationship at a time. Language doesn’t work like that. A sentence can have structure, meaning, and long-range links all at once. That’s why transformers use multi-head attention . What happens in multi-head attention Instead of doing attention once, the model does it multiple times in parallel. Each run is called a head, and each head has its own learned weights for Query, Key, and Value. So every head looks at the same sentence, but in its own way.…