Chapter 10: Multi-Head Attention and the MLP Block

1 / 2

Chapter 10: Multi-Head Attention and the MLP Block

DEV Community·Gary Jackson·about 1 month ago

#KfyhMNIb

#csharp #machinelearning #transformers #value #head #list

Reading 0:00

15s threshold

What You'll Build Multi-head attention (running several attention computations in parallel, each on its own slice of the per-token embedding vector) and the MLP block (a two-layer feed-forward network for per-position "thinking"). Both concepts are introduced here and implemented in Model.cs in Chapter 11. Depends On Chapters 5, 8, 9 (Helpers, RmsNorm, residual connections, single-head attention). Why Multiple Heads? A single attention head can only learn one kind of "what am I looking for?" pattern. With multiple heads, the model can look for different kinds of relationships at the same time. In larger models with bigger embedding dimensions, individual heads often specialise in distinct patterns (one might track syntax, another semantics). At our small scale (headDimension = 4), the specialisation is fuzzier, but the mechanism is the same. The trick: instead of running 4 full-size attention computations, we split the embedding dimension into 4 slices.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Chapter 10: Multi-Head Attention and the MLP Block