Unlocking Similarities: Computing Cosine Similarity Between Matrices in PyTorch

2024-07-27

Cosine similarity is a metric that measures the directional similarity between two vectors. It calculates the cosine of the angle between the vectors, ranging from -1 (completely opposite directions) to 1 (identical directions) and 0 (orthogonal or perpendicular). In machine learning, cosine similarity is often used for tasks like:

Recommendation Systems: Finding items similar to a user's past preferences.
Document Retrieval: Ranking documents based on their relevance to a query.
Image Recognition: Identifying similar images based on their feature vectors.
Anomaly Detection: Detecting data points that deviate significantly from the majority.

Computing Cosine Similarity in PyTorch

PyTorch provides a convenient way to calculate the cosine similarity between all rows of two matrices. Here's the breakdown:

Import Necessary Libraries:
```
import torch
```
Define Your Matrices:

Create your two matrices (matrix_1 and matrix_2) using torch.tensor. Ensure they have the same number of columns (representing feature dimensions) for meaningful similarity calculation.
```
matrix_1 = torch.tensor([[1, 2, 3], [4, 5, 6]])
matrix_2 = torch.tensor([[7, 8, 9], [10, 11, 12]])
```
Calculate Cosine Similarity:

Use the torch.nn.functional.cosine_similarity function. This function takes two tensors as input (matrix_1 and matrix_2) and an optional dim argument that specifies the dimension along which the similarity is computed. By default, dim=1 is used, meaning the similarity is calculated between rows (vectors) in the first dimension.
```
cosine_similarity = torch.nn.functional.cosine_similarity(matrix_1, matrix_2)
```
The output (cosine_similarity) will be a tensor with a shape of (num_rows_in_matrix_1, num_rows_in_matrix_2). Each element represents the cosine similarity between a row in matrix_1 and a row in matrix_2.

Example:

import torch

matrix_1 = torch.tensor([[1, 2, 3], [4, 5, 6]])
matrix_2 = torch.tensor([[7, 8, 9], [10, 11, 12]])

cosine_similarity = torch.nn.functional.cosine_similarity(matrix_1, matrix_2)

print(cosine_similarity)

This code will output:

tensor([0.9486  0.9063])

The first element (0.9486) represents the cosine similarity between the first row of matrix_1 (vector [1, 2, 3]) and the first row of matrix_2 (vector [7, 8, 9]).

Key Points:

This approach leverages PyTorch's broadcasting mechanism to efficiently calculate similarity for all row pairs.
Ensure your matrices have the same number of columns for valid cosine similarity calculation.
The resulting tensor can be used for further analysis, such as finding the most similar rows or filtering based on a minimum similarity threshold.

import torch

# Example 1: Basic Cosine Similarity for All Rows
matrix_1 = torch.tensor([[1, 2, 3], [4, 5, 6]])
matrix_2 = torch.tensor([[7, 8, 9], [10, 11, 12]])

cosine_similarity = torch.nn.functional.cosine_similarity(matrix_1, matrix_2)
print("Cosine Similarity (All Rows):\n", cosine_similarity)

# Example 2: Selecting Rows with Highest Similarity
highest_similarity_row, _ = torch.max(cosine_similarity, dim=1)  # Get max similarity per row in matrix_1
print("\nHighest Similarity Scores for Rows in Matrix 1:", highest_similarity_row)

# Example 3: Finding Most Similar Row in Matrix 2 for Each Row in Matrix 1
most_similar_indices = torch.argmax(cosine_similarity, dim=1)  # Get index of most similar row in matrix_2
print("\nIndices of Most Similar Rows in Matrix 2 for Each Row in Matrix 1:", most_similar_indices)

# Example 4: Filtering Based on Minimum Similarity Threshold
threshold = 0.8  # Set a minimum similarity threshold
filtered_rows = matrix_1[cosine_similarity.flatten() >= threshold]  # Flatten and filter based on threshold
print("\nRows in Matrix 1 with Similarity >= 0.8 to Any Row in Matrix 2:\n", filtered_rows)

Explanation of Additional Examples:

Example 2: We find the highest cosine similarity score for each row in matrix_1 compared to all rows in matrix_2. We use torch.max along dim=1 to achieve this.
Example 3: We identify the index of the most similar row in matrix_2 for each row in matrix_1. torch.argmax along dim=1 helps us find this index.
Example 4: We set a minimum similarity threshold and filter the rows in matrix_1 that have at least one corresponding row in matrix_2 with a similarity score above the threshold. We use flatten to convert the similarity tensor into a 1D tensor before filtering.

torch.einsum offers a concise way to perform linear algebraic operations. Here's how to use it for cosine similarity:

import torch

matrix_1 = torch.tensor([[1, 2, 3], [4, 5, 6]])
matrix_2 = torch.tensor([[7, 8, 9], [10, 11, 12]])

cosine_similarity = torch.einsum('ik,jk->ij', matrix_1 / matrix_1.norm(dim=1, keepdim=True), 
                                 matrix_2 / matrix_2.norm(dim=1, keepdim=True))
print("Cosine Similarity (einsum):\n", cosine_similarity)

Explanation:

We normalize each row in both matrices by their L2 norm to obtain unit vectors.
torch.einsum performs a contracted dot product along the last dimension of matrix_1 (i) and the second dimension of matrix_2 (k), resulting in the cosine similarity matrix.

Using a Loop (for Small Datasets):

For small datasets, a loop-based approach can be simpler to understand:

import torch

def cosine_similarity_loop(matrix_1, matrix_2):
  cosine_similarities = []
  for row_1 in matrix_1:
    row_similarities = []
    for row_2 in matrix_2:
      row_similarities.append(torch.dot(row_1, row_2) / (row_1.norm() * row_2.norm()))
    cosine_similarities.append(torch.tensor(row_similarities))
  return torch.stack(cosine_similarities)

matrix_1 = torch.tensor([[1, 2, 3], [4, 5, 6]])
matrix_2 = torch.tensor([[7, 8, 9], [10, 11, 12]])

cosine_similarity = cosine_similarity_loop(matrix_1, matrix_2)
print("Cosine Similarity (Loop):\n", cosine_similarity)

We iterate through each row in matrix_1 and calculate the cosine similarity with all rows in matrix_2 using dot product and normalization.
This approach is less efficient for large datasets compared to vectorized methods.

Choosing the Right Method:

torch.nn.functional.cosine_similarity: This is the most recommended approach due to its efficiency and ease of use.
torch.einsum: This offers a concise and potentially faster alternative, especially for larger datasets.
Loop-based approach: Consider this for understanding the concept but avoid it for large datasets due to performance limitations.

machine-learning neural-network pytorch