In the previous article , we just began introducing the concept of encoder-decoder attention. Now lets start digging into the details. Encoder–Decoder Attention in Action Just like in self-attention, we start by creating query values . In this case, we create two values to represent the query for the <EOS> token in the decoder. Next, we create key values for each word in the encoder output . Calculating Similarity Now, we calculate the similarity between the <EOS> token in the decoder and each word in the encoder. This is done using the dot product . Applying Softmax We then pass these similarity scores through a softmax function : This gives us weights that determine how much attention the decoder should pay to each input word. In this example: The first input word gets 100% attention The second word gets 0% attention This means the decoder will focus entirely on the first input word when deciding the first translated word. What’s Next?…