Unmasking the Secrets: Effective Attention Control with src_mask and src_key_padding_mask
Both masks are used during the attention mechanism within the transformer model to prevent the model from focusing on irrelevant parts of the input sequence (src
). However, they address different scenarios:
Key Distinction:
The key difference lies in the level of granularity:
src_mask
allows for fine-grained control by masking specific positions in the sequence, even if they are not padding tokens.src_key_padding_mask
is a more coarse-grained approach that simply masks out entire padding tokens.
When to Use Which:
- Use
src_mask
when you need to selectively prevent attention between specific positions in the sequence, independent of padding. - Use
src_key_padding_mask
when you have variable-length sequences with padding tokens that you want the model to disregard during attention calculations.
Example Code:
Here's an illustrative example (assuming PyTorch syntax):
import torch
# Sample sequence with padding tokens (represented by -1)
src = torch.tensor([[1, 2, 3, 4, -1], [5, 6, 7, -1, -1]])
# src_mask example: prevent attention between positions (1, 3) and (2, 4)
src_mask = torch.ones(src.size(0), src.size(1))
src_mask[1, 3] = 0 # Mask attention between position 1 (index 0) and position 3 (index 2)
src_mask[2, 4] = 0 # Mask attention between position 2 (index 1) and position 4 (index 3)
# src_key_padding_mask example: indicate padding tokens with True
src_key_padding_mask = (src == -1).to(torch.bool)
In summary:
src_mask
provides flexibility for custom masking, whilesrc_key_padding_mask
handles padding automatically.- Choose the right mask based on your specific requirements for attention control within the transformer model.
This code snippet creates a sample padded sequence and generates both masks accordingly:
import torch
# Sample sequence with padding tokens (represented by -1)
src = torch.tensor([[1, 2, 3, 4, -1], [5, 6, 7, -1, -1]])
# src_key_padding_mask: indicate padding tokens with True
src_key_padding_mask = (src == -1).to(torch.bool)
# Generate src_mask from padding (optional, if you don't need custom masking)
src_mask = ~src_key_padding_mask # Invert padding mask to create attention mask
src_mask = src_mask.unsqueeze(1).expand(-1, src.size(1), src.size(1)) # Broadcast for attention
Explanation:
- We define a sample sequence
src
with padding tokens represented by -1. src_key_padding_mask
is created using a boolean tensor where True indicates padding positions.- Optionally,
src_mask
can be generated by invertingsrc_key_padding_mask
(assuming attention for non-padded elements). - Finally,
src_mask
is reshaped usingunsqueeze
andexpand
to match the expected 2D shape(S, S)
for attention calculations.
Custom Masking with src_mask:
This code shows how to create a custom src_mask
to prevent attention between specific positions:
import torch
# Sample sequence (no padding)
src = torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8]])
# src_mask: prevent attention between positions (1, 3) and (2, 4)
src_mask = torch.ones(src.size(0), src.size(1))
src_mask[1, 3] = 0 # Mask attention between position 1 (index 0) and position 3 (index 2)
src_mask[2, 4] = 0 # Mask attention between position 2 (index 1) and position 4 (index 3)
# src_key_padding_mask (not needed here, as there's no padding)
src_key_padding_mask = None
src
is a sample sequence without padding.src_mask
is a 2D tensor of ones, initially allowing all attention.- We set specific elements to zero to prevent attention between desired positions (1, 3) and (2, 4).
src_key_padding_mask
is not needed in this case since there's no padding.
-
Causal Masking (Subsequent Mask):
- This technique is particularly useful in tasks like machine translation, where the model shouldn't attend to future words in the target sequence while generating the current word.
- Instead of an explicit mask, you can create a lower triangular matrix for
src_mask
with ones on the diagonal and zeros above. This ensures the model only attends to previous tokens in the sequence. - However, this approach doesn't offer the same level of flexibility as
src_mask
for custom masking across different positions.
-
Positional Encoding with Distance Embeddings:
- This method involves modifying the positional encodings used in the transformer to incorporate distance information between tokens.
- You can add distance embeddings to the positional encodings, making the model attend less to tokens that are further away in the sequence.
- This approach requires careful design of the distance embeddings and might not be as efficient for very long sequences compared to masking.
-
Learned Attention Masks:
- In some cases, you might want the model to learn the attention mask dynamically during training.
- This can be achieved by introducing an additional network that predicts the attention weights and acts as an attention gate.
- While this approach can be powerful, it adds complexity to the model and requires more training data for the network to learn effective masks.
Choosing the Right Method:
The best method for you depends on:
- The task you're working on (e.g., machine translation vs. question answering)
- The level of control you need over attention
- The trade-off between flexibility and efficiency
General Recommendations:
- For most scenarios,
src_mask
andsrc_key_padding_mask
are a good starting point due to their simplicity and effectiveness. - If you need to prevent attention to future tokens (e.g., in machine translation), consider using causal masking.
- Explore learned attention masks or distance embeddings if you require very fine-grained control over attention or want the model to learn attention patterns automatically, but be aware of the increased complexity.
pytorch transformer-model