Picking Your Way Through Data: A Guide to gather in PyTorch

2024-07-27

Here's how it works:

Input:
Picking Values:
- gather uses the index tensor to navigate the spreadsheet.
- For each row in the index tensor, it goes to the corresponding row in the original spreadsheet.
- Within that row, it picks the value at the column specified by the number in the index tensor.
Creating a New Spreadsheet:
- gather doesn't modify the original spreadsheet. Instead, it creates a brand new spreadsheet (another tensor) with the picked values.
- The new spreadsheet will have the same number of rows as the original one, but the number of columns might change depending on the index instructions.

Example:

Suppose you have a spreadsheet (tensor) with student names and their scores in Math, English, and Science:

+-------+-------+-------+
| Name  | Math  | English |
+-------+-------+-------+
| Alice  | 85    | 90     |
| Bob    | 78    | 82     |
| Charlie| 92    | 88     |
+-------+-------+-------+

You want to create a new spreadsheet that shows only the Math scores for Alice and Charlie.

Here's what you would do:

Gathering: Use the gather function:

import torch

scores = torch.tensor([[85, 90], [78, 82], [92, 88]])
index = torch.tensor([0, 2])  # Pick rows for Alice and Charlie

math_scores = torch.gather(scores, dim=1, index=index)
print(math_scores)

This code will print:

tensor([85, 92])

Key Points:

dim argument in gather specifies which dimension (rows or columns) to use for picking values (here, dim=1 means columns).
index values must be within the valid range of the chosen dimension (0 to number of columns minus 1 in this case).

import torch

# Create a 1D tensor
data = torch.tensor([3, 6, 1, 8, 2])

# Indices to pick elements (0-based indexing)
indices = torch.tensor([2, 4, 1])

# Gather elements based on indices
picked_elements = torch.gather(data, dim=0, index=indices)

print("Original tensor:", data)
print("Indices:", indices)
print("Picked elements:", picked_elements)

Original tensor: tensor([ 3  6  1  8  2])
Indices: tensor([2  4  1])
Picked elements: tensor([ 1  2  6])

As you can see, it picks elements at indices 2, 4, and 1 (which are 1, 2, and 6) from the original tensor.

Example 2: Gathering Rows from a 2D Tensor

import torch

# Create a 2D tensor
matrix = torch.tensor([[1, 4, 7], [2, 5, 8], [3, 6, 9]])

# Indices to pick rows (0-based indexing)
row_indices = torch.tensor([1, 0])

# Gather rows based on indices
picked_rows = torch.gather(matrix, dim=0, index=row_indices)

print("Original matrix:", matrix)
print("Row indices:", row_indices)
print("Picked rows:", picked_rows)

Original matrix: tensor([[1  4  7]
        [2  5  8]
        [3  6  9]])
Row indices: tensor([1  0])
Picked rows: tensor([[2  5  8]
        [1  4  7]])

Here, it picks rows 1 and 0 (which are the second and first rows) from the original matrix.

Example 3: Gathering Columns with Out-of-Bounds Handling

import torch

# Create a 2D tensor
matrix = torch.tensor([[1, 4, 7], [2, 5, 8], [3, 6, 9]])

# Invalid index for out-of-bounds handling demonstration
col_indices = torch.tensor([2, 1, 4])  # Index 4 is out of bounds

try:
  # Gather columns (will raise an error for out-of-bounds index)
  picked_cols = torch.gather(matrix, dim=1, index=col_indices)
except RuntimeError as e:
  print("Error:", e)
  print("Out-of-bounds indices are not allowed by default.")

# Alternative with out=torch.tensor.new_empty for handling missing values
out_of_bounds = torch.tensor.new_empty(size=(3, 3))  # Empty tensor for missing values
picked_cols_safe = torch.gather(matrix, dim=1, index=col_indices, out=out_of_bounds)

print("Original matrix:", matrix)
print("Column indices:", col_indices)
print("Handling out-of-bounds (default):")
# This will print the error message
print("Handling out-of-bounds (safe):")
print(picked_cols_safe)

This code demonstrates how to handle out-of-bounds indices. The first attempt will raise a RuntimeError because index contains an invalid value (4) that's beyond the matrix's column range (0 to 2).

The second attempt shows a safe way to handle missing values. We create an empty tensor (out_of_bounds) with the same size as the expected output and use it as the out argument in gather. This way, the out-of-bounds index (4) will result in an empty value in the corresponding position of picked_cols_safe.

For small datasets or simple operations, you can iterate through the data manually using a loop and conditional statements to pick the desired elements. This approach might be less efficient for large datasets but can provide more control over the selection process.

Here's an example for a 1D tensor:

import torch

data = torch.tensor([3, 6, 1, 8, 2])
indices = torch.tensor([2, 4, 1])

picked_elements = []
for i in range(len(indices)):
  index = indices[i].item()  # Convert tensor element to Python int
  picked_elements.append(data[index])

print("Picked elements:", torch.tensor(picked_elements))

torch.index_select (for Selecting Rows/Columns):

If you specifically need to select entire rows or columns based on indices, torch.index_select can be a good alternative. It's generally more efficient than manual looping for larger datasets.

import torch

matrix = torch.tensor([[1, 4, 7], [2, 5, 8], [3, 6, 9]])
row_indices = torch.tensor([1, 0])

picked_rows = torch.index_select(matrix, dim=0, index=row_indices)
print("Picked rows:", picked_rows)

NumPy Integration (if Applicable):

If you're already using NumPy for data manipulation, you can leverage its take_along_axis function for similar functionality. However, ensure proper conversion between NumPy and PyTorch tensors when necessary.

import torch
import numpy as np

data = torch.tensor([3, 6, 1, 8, 2])
indices = torch.tensor([2, 4, 1])

# Convert tensors to NumPy arrays (if memory permits)
data_np = data.numpy()
indices_np = indices.numpy()

picked_elements = torch.from_numpy(np.take_along_axis(data_np, indices_np, axis=0))
print("Picked elements:", picked_elements)

Choosing the Right Method:

For simple cases or educational purposes, manual looping can be used.
For selecting rows/columns efficiently, torch.index_select is a good choice.
If you're already using NumPy, take_along_axis might be convenient.
For most general-purpose data selection, gather remains the recommended approach due to its flexibility and built-in support for handling out-of-bounds indices and other functionalities.

python pytorch