Python Pandas: Selectively Remove DataFrame Columns by Name Pattern
Import pandas library:
import pandas as pd
Create a sample DataFrame:
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': ['apple', 'banana', 'carrot', 'date'], 'C': ['apple pie', 'banana bread', 'carrot cake', 'date pudding']})
Specify the string to remove:
Define the string you want to filter out from column names. For instance, you might want to remove columns containing "apple".
string_to_remove = 'apple'
Filter for columns to keep:
We'll use boolean indexing to select columns that don't contain the specified string in their names. Here's how to achieve this:
def filter_columns_by_name(df, string_to_remove):
return df.loc[:, ~df.columns.str.contains(string_to_remove)]
df.columns
: This selects the column names of the DataFrame..str.contains(string_to_remove)
: This checks if each column name contains thestring_to_remove
. It returns a boolean Series indicating True for matches and False otherwise.~
: The tilde (~) acts as a negation operator. It inverts the boolean Series, resulting in True for columns that don't contain the string and False for those that do..loc[:, ]
: This indexing selects rows (':') and all columns specified by the boolean Series. In this case, it keeps columns where the corresponding value in the boolean Series is True.
Apply filtering and create a new DataFrame (optional):
- Call the function with your DataFrame and the string to remove.
- This creates a new DataFrame containing only the columns that don't have the specified string in their names.
result = filter_columns_by_name(df.copy(), string_to_remove)
print(result)
Explanation of the output:
The print(result)
statement will display a new DataFrame without columns containing the specified string in their names. For example, if string_to_remove
is "apple", the output would be:
A B C
0 1 apple apple pie
1 2 banana banana bread
2 3 carrot carrot cake
3 4 date date pudding
Important Note:
- The
filter_columns_by_name
function creates a copy of the original DataFrame usingdf.copy()
. This is because modifying DataFrames in-place can lead to unexpected behavior in future operations.
import pandas as pd
# Create a sample DataFrame
data = {'A': [1, 2, 3, 4], 'Name_with_apple': ['apple', 'banana', 'carrot', 'date'], 'C': ['apple pie', 'banana bread', 'carrot cake', 'date pudding']}
df = pd.DataFrame(data)
# String to remove from column names
string_to_remove = 'apple'
# Function to filter columns (optional for creating a copy)
def filter_columns_by_name(df, string_to_remove):
return df.loc[:, ~df.columns.str.contains(string_to_remove)]
# In-place modification (modifies the original DataFrame)
df.drop(df.filter(like=string_to_remove), axis=1, inplace=True)
print("Modified DataFrame (in-place):")
print(df)
# Creating a new DataFrame (keeps original intact)
new_df = filter_columns_by_name(df.copy(), string_to_remove)
print("\nNew DataFrame (copy):")
print(new_df)
This code demonstrates both approaches:
- In-place modification: It uses
df.drop
with a filter to directly remove columns containing the string from the original DataFrame (df
). - Creating a copy: The
filter_columns_by_name
function is used to create a new DataFrame (new_df
) that excludes the unwanted columns, leaving the original DataFrame (df
) untouched.
List comprehension with df.drop:
This approach uses a list comprehension to create a list of columns to keep and then drops the remaining ones.
import pandas as pd
# ... (your DataFrame creation code)
string_to_remove = 'apple'
# Create a list of columns to keep using list comprehension
cols_to_keep = [col for col in df.columns if string_to_remove not in col]
# Drop columns not in the list (modifies original DataFrame)
df.drop(df.columns.difference(cols_to_keep), axis=1, inplace=True)
print(df)
Explanation:
- The list comprehension iterates through the DataFrame's columns (
df.columns
). - It checks if the
string_to_remove
is not present in the current column name usingnot in
. - If the string is not present, the column name is added to the
cols_to_keep
list. - Finally,
df.drop
removes all columns except those in thecols_to_keep
list, effectively dropping columns with the specified string.
df.reindex:
This method uses df.reindex
to select only the desired columns based on a boolean Series.
import pandas as pd
# ... (your DataFrame creation code)
string_to_remove = 'apple'
# Create a boolean Series indicating columns to keep
keep_cols = ~df.columns.str.contains(string_to_remove)
# Reindex the DataFrame with the boolean Series (modifies original DataFrame)
df = df.reindex(columns=keep_cols)
print(df)
- Similar to the previous method,
~df.columns.str.contains(string_to_remove)
creates a boolean Series where True indicates columns to keep. df.reindex(columns=keep_cols)
reindexes the DataFrame based on thekeep_cols
Series. This effectively removes columns where the Series is False (doesn't contain the string).
Both methods achieve the same result as the previous approaches, offering alternative ways to filter and drop columns based on string presence. Remember to choose the method that best suits your coding style and readability preference.
python pandas dataframe