What You'll Build The attention mechanism: the only place in a transformer where a token at position t gets to look at tokens at positions 0..t-1 . This is specifically self-attention , where the token attends to other tokens in the same sequence. (You might encounter "cross-attention" in other materials, which is used in encoder-decoder models where tokens attend to a different sequence. We don't use cross-attention here.) Depends On Chapters 1-2, 5 (Value, Helpers). The Core Idea Until now, each token has been processed independently. The token at position 3 has no idea what's at positions 0, 1, or 2. Attention fixes this by letting each token ask: "what earlier tokens are relevant to me?" Because each token can only look backward (at positions before it, never ahead), this is called causal attention . The past can influence the future, but not the other way around.…