Chapter 9: Single-Head Attention - Tokens Looking at Each Other

1 / 2

Chapter 9: Single-Head Attention - Tokens Looking at Each Other

DEV Community·Gary Jackson·about 1 month ago

#yO1WcNf6

#why #csharp #machinelearning #value #list #attention

Reading 0:00

15s threshold

What You'll Build The attention mechanism: the only place in a transformer where a token at position t gets to look at tokens at positions 0..t-1 . This is specifically self-attention , where the token attends to other tokens in the same sequence. (You might encounter "cross-attention" in other materials, which is used in encoder-decoder models where tokens attend to a different sequence. We don't use cross-attention here.) Depends On Chapters 1-2, 5 (Value, Helpers). The Core Idea Until now, each token has been processed independently. The token at position 3 has no idea what's at positions 0, 1, or 2. Attention fixes this by letting each token ask: "what earlier tokens are relevant to me?" Because each token can only look backward (at positions before it, never ahead), this is called causal attention . The past can influence the future, but not the other way around.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Chapter 9: Single-Head Attention - Tokens Looking at Each Other