Filtering Pandas DataFrames: Finding Rows That Don't Contain Specific Values
Understanding the Task:
- You have a DataFrame containing text data in one or more columns.
- You want to filter the DataFrame to keep only rows where the text in a specific column does not include a particular value (substring).
Methods for "Does-Not-Contain" Search:
Pandas doesn't have a built-in "does not contain" operator, but you can achieve this using negation with the isin()
method:
~ (Bitwise NOT) with isin():
- This is the most common approach.
- Use
~
(tilde) beforedf[column_name].isin(list_of_values)
to invert the selection. - Example:
import pandas as pd data = {'column_name': ['apple', 'banana', 'cherry', 'grapefruit']} df = pd.DataFrame(data) # Find rows where 'column_name' does not contain 'apple' filtered_df = df[~df['column_name'].isin(['apple'])] print(filtered_df)
This will output:
column_name 1 banana 2 cherry
str.contains() with Negation:
- Use
df[column_name].str.contains(pattern) == False
to directly filter for non-matches. - This might be less efficient for large DataFrames due to string operations.
filtered_df = df[df['column_name'].str.contains('apple') == False] print(filtered_df)
(This will produce the same output as the previous example.)
- Use
Choosing the Right Method:
- If performance is a concern, especially for very large DataFrames,
~
withisin()
is often preferred. However, both methods achieve the desired result.
Additional Considerations:
- Case Sensitivity: By default, both methods are case-sensitive. If you need case-insensitive matching, use
str.lower()
orstr.upper()
on the column and pattern before comparison. - Regular Expressions: For more complex matching patterns, you can use regular expressions with
str.contains()
.
I hope this explanation clarifies how to search for "does-not-contain" data in pandas DataFrames!
Example 1: Case-Insensitive Matching (~ with isin()):
import pandas as pd
data = {'column_name': ['Apple', 'Banana', 'CHERRY', 'Grapefruit']}
df = pd.DataFrame(data)
# Find rows where 'column_name' does not contain 'apple' (case-insensitive)
pattern = 'apple' # Case-insensitive pattern
filtered_df = df[~df['column_name'].str.lower().isin([pattern.lower()])].copy() # Make a copy to avoid SettingWithCopy warning
print(filtered_df)
This code ensures case-insensitive matching by converting both the column values and the pattern to lowercase before comparison using str.lower()
. The .copy()
method is used to avoid a potential SettingWithCopy warning.
Example 2: Regular Expression with str.contains():
import pandas as pd
data = {'column_name': ['apple pie', 'banana shake', 'cherry cobbler', 'grapefruit juice']}
df = pd.DataFrame(data)
# Find rows where 'column_name' does not contain a word starting with 'app'
pattern = r'^app\b' # Regular expression for word starting with 'app'
filtered_df = df[~df['column_name'].str.contains(pattern)]
print(filtered_df)
This example demonstrates using a regular expression (r'^app\b'
) with str.contains()
to find rows where the column value doesn't start with the word "app" (whole word match).
These examples illustrate different approaches to handle "does-not-contain" searches in pandas, considering case sensitivity and regular expressions when needed.
Boolean Indexing with np.where:
This method uses NumPy's np.where
function to create a boolean mask and then filter the DataFrame:
import pandas as pd
import numpy as np
data = {'column_name': ['apple', 'banana', 'cherry', 'grapefruit']}
df = pd.DataFrame(data)
pattern = 'apple'
# Create a mask for rows where 'column_name' does not contain 'apple'
mask = np.where(~df['column_name'].isin([pattern]))[0]
# Filter the DataFrame using the mask
filtered_df = df.iloc[mask]
print(filtered_df)
This approach can be less efficient than ~
with isin()
for larger DataFrames due to the use of NumPy functions.
Custom Function with apply:
For more complex filtering criteria, you can define a custom function and apply it to the column using apply()
:
import pandas as pd
def does_not_contain(text, pattern):
return not pattern in text
data = {'column_name': ['apple pie', 'banana shake', 'cherry cobbler', 'grapefruit juice']}
df = pd.DataFrame(data)
pattern = 'apple'
# Apply the custom function to filter rows
filtered_df = df[df['column_name'].apply(does_not_contain, args=(pattern,))]
print(filtered_df)
This method offers flexibility but might be slower for large DataFrames compared to ~
with isin()
.
Remember to choose the method that best suits your specific needs based on factors like performance, complexity, and readability.
python pandas contains