What You'll Build Multi-head attention (running several attention computations in parallel, each on its own slice of the per-token embedding vector) and the MLP block (a two-layer feed-forward network for per-position "thinking"). Both concepts are introduced here and implemented in Model.cs in Chapter 11. Depends On Chapters 5, 8, 9 (Helpers, RmsNorm, residual connections, single-head attention). Why Multiple Heads? A single attention head can only learn one kind of "what am I looking for?" pattern. With multiple heads, the model can look for different kinds of relationships at the same time. In larger models with bigger embedding dimensions, individual heads often specialise in distinct patterns (one might track syntax, another semantics). At our small scale (headDimension = 4), the specialisation is fuzzier, but the mechanism is the same. The trick: instead of running 4 full-size attention computations, we split the embedding dimension into 4 slices.…