Picking Your Way Through Data: A Guide to gather in PyTorch
Here's how it works:
-
Input:
-
Picking Values:
gather
uses theindex
tensor to navigate the spreadsheet.- For each row in the
index
tensor, it goes to the corresponding row in the original spreadsheet. - Within that row, it picks the value at the column specified by the number in the
index
tensor.
-
Creating a New Spreadsheet:
gather
doesn't modify the original spreadsheet. Instead, it creates a brand new spreadsheet (another tensor) with the picked values.- The new spreadsheet will have the same number of rows as the original one, but the number of columns might change depending on the
index
instructions.
Example:
Suppose you have a spreadsheet (tensor) with student names and their scores in Math, English, and Science:
+-------+-------+-------+
| Name | Math | English |
+-------+-------+-------+
| Alice | 85 | 90 |
| Bob | 78 | 82 |
| Charlie| 92 | 88 |
+-------+-------+-------+
You want to create a new spreadsheet that shows only the Math scores for Alice and Charlie.
Here's what you would do:
-
Gathering: Use the
gather
function:import torch scores = torch.tensor([[85, 90], [78, 82], [92, 88]]) index = torch.tensor([0, 2]) # Pick rows for Alice and Charlie math_scores = torch.gather(scores, dim=1, index=index) print(math_scores)
This code will print:
tensor([85, 92])
Key Points:
dim
argument ingather
specifies which dimension (rows or columns) to use for picking values (here,dim=1
means columns).index
values must be within the valid range of the chosen dimension (0 to number of columns minus 1 in this case).
import torch
# Create a 1D tensor
data = torch.tensor([3, 6, 1, 8, 2])
# Indices to pick elements (0-based indexing)
indices = torch.tensor([2, 4, 1])
# Gather elements based on indices
picked_elements = torch.gather(data, dim=0, index=indices)
print("Original tensor:", data)
print("Indices:", indices)
print("Picked elements:", picked_elements)
Original tensor: tensor([ 3 6 1 8 2])
Indices: tensor([2 4 1])
Picked elements: tensor([ 1 2 6])
As you can see, it picks elements at indices 2, 4, and 1 (which are 1, 2, and 6) from the original tensor.
Example 2: Gathering Rows from a 2D Tensor
import torch
# Create a 2D tensor
matrix = torch.tensor([[1, 4, 7], [2, 5, 8], [3, 6, 9]])
# Indices to pick rows (0-based indexing)
row_indices = torch.tensor([1, 0])
# Gather rows based on indices
picked_rows = torch.gather(matrix, dim=0, index=row_indices)
print("Original matrix:", matrix)
print("Row indices:", row_indices)
print("Picked rows:", picked_rows)
Original matrix: tensor([[1 4 7]
[2 5 8]
[3 6 9]])
Row indices: tensor([1 0])
Picked rows: tensor([[2 5 8]
[1 4 7]])
Here, it picks rows 1 and 0 (which are the second and first rows) from the original matrix.
Example 3: Gathering Columns with Out-of-Bounds Handling
import torch
# Create a 2D tensor
matrix = torch.tensor([[1, 4, 7], [2, 5, 8], [3, 6, 9]])
# Invalid index for out-of-bounds handling demonstration
col_indices = torch.tensor([2, 1, 4]) # Index 4 is out of bounds
try:
# Gather columns (will raise an error for out-of-bounds index)
picked_cols = torch.gather(matrix, dim=1, index=col_indices)
except RuntimeError as e:
print("Error:", e)
print("Out-of-bounds indices are not allowed by default.")
# Alternative with out=torch.tensor.new_empty for handling missing values
out_of_bounds = torch.tensor.new_empty(size=(3, 3)) # Empty tensor for missing values
picked_cols_safe = torch.gather(matrix, dim=1, index=col_indices, out=out_of_bounds)
print("Original matrix:", matrix)
print("Column indices:", col_indices)
print("Handling out-of-bounds (default):")
# This will print the error message
print("Handling out-of-bounds (safe):")
print(picked_cols_safe)
This code demonstrates how to handle out-of-bounds indices. The first attempt will raise a RuntimeError
because index
contains an invalid value (4) that's beyond the matrix's column range (0 to 2).
The second attempt shows a safe way to handle missing values. We create an empty tensor (out_of_bounds
) with the same size as the expected output and use it as the out
argument in gather
. This way, the out-of-bounds index (4) will result in an empty value in the corresponding position of picked_cols_safe
.
For small datasets or simple operations, you can iterate through the data manually using a loop and conditional statements to pick the desired elements. This approach might be less efficient for large datasets but can provide more control over the selection process.
Here's an example for a 1D tensor:
import torch
data = torch.tensor([3, 6, 1, 8, 2])
indices = torch.tensor([2, 4, 1])
picked_elements = []
for i in range(len(indices)):
index = indices[i].item() # Convert tensor element to Python int
picked_elements.append(data[index])
print("Picked elements:", torch.tensor(picked_elements))
torch.index_select (for Selecting Rows/Columns):
If you specifically need to select entire rows or columns based on indices, torch.index_select
can be a good alternative. It's generally more efficient than manual looping for larger datasets.
import torch
matrix = torch.tensor([[1, 4, 7], [2, 5, 8], [3, 6, 9]])
row_indices = torch.tensor([1, 0])
picked_rows = torch.index_select(matrix, dim=0, index=row_indices)
print("Picked rows:", picked_rows)
NumPy Integration (if Applicable):
If you're already using NumPy for data manipulation, you can leverage its take_along_axis
function for similar functionality. However, ensure proper conversion between NumPy and PyTorch tensors when necessary.
import torch
import numpy as np
data = torch.tensor([3, 6, 1, 8, 2])
indices = torch.tensor([2, 4, 1])
# Convert tensors to NumPy arrays (if memory permits)
data_np = data.numpy()
indices_np = indices.numpy()
picked_elements = torch.from_numpy(np.take_along_axis(data_np, indices_np, axis=0))
print("Picked elements:", picked_elements)
Choosing the Right Method:
- For simple cases or educational purposes, manual looping can be used.
- For selecting rows/columns efficiently,
torch.index_select
is a good choice. - If you're already using NumPy,
take_along_axis
might be convenient. - For most general-purpose data selection,
gather
remains the recommended approach due to its flexibility and built-in support for handling out-of-bounds indices and other functionalities.
python pytorch