Menu

Post image 1
Post image 2
1 / 2
0

Chapter 9: Single-Head Attention - Tokens Looking at Each Other

DEV Community·Gary Jackson·about 1 month ago
#yO1WcNf6
Reading 0:00
15s threshold

What You'll Build The attention mechanism: the only place in a transformer where a token at position t gets to look at tokens at positions 0..t-1 . This is specifically self-attention , where the token attends to other tokens in the same sequence. (You might encounter "cross-attention" in other materials, which is used in encoder-decoder models where tokens attend to a different sequence. We don't use cross-attention here.) Depends On Chapters 1-2, 5 (Value, Helpers). The Core Idea Until now, each token has been processed independently. The token at position 3 has no idea what's at positions 0, 1, or 2. Attention fixes this by letting each token ask: "what earlier tokens are relevant to me?" Because each token can only look backward (at positions before it, never ahead), this is called causal attention . The past can influence the future, but not the other way around.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More