Pandas Filtering Techniques: Mastering 'IN' and 'NOT IN' Conditions
Using isin() for "IN":
Imagine you have a DataFrame df
with a column named "City". You want to select rows where the city is either "New York" or "Paris". In SQL, you might use:
SELECT * FROM df WHERE City IN ('New York', 'Paris');
In pandas, you can achieve the same using the isin()
method:
filtered_df = df[df['City'].isin(['New York', 'Paris'])]
This selects rows where the "City" column values are in the list ['New York', 'Paris']
.
Let's say you want to exclude rows where the city is "New York" or "Paris". Similar to SQL's "NOT IN":
SELECT * FROM df WHERE City NOT IN ('New York', 'Paris');
In pandas, there are a few ways to achieve this:
- Using ~isin():
filtered_df = df[~df['City'].isin(['New York', 'Paris'])]
The ~
symbol negates the condition, so it selects rows where the city is not in the list.
- Using boolean indexing:
filtered_df = df[df['City'].isin(['New York', 'Paris']) == False]
This explicitly checks if the isin()
condition is False
.
- Using numpy:
import numpy as np
filtered_df = df[np.logical_not(df['City'].isin(['New York', 'Paris']))]
This leverages NumPy's logical_not
function for negation.
These methods all achieve the same result of excluding rows based on the "City" column.
Key Points:
isin()
is used for "IN" filtering.~isin()
or boolean indexing with== False
is used for "NOT IN" filtering.- These methods work on any DataFrame column and can be used with lists, tuples, or other iterables containing the values to filter by.
I hope this explanation clarifies filtering DataFrames using "in" and "not in" conditions in pandas!
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'New York'],
'Age': [25, 30, 28, 42, 22]}
df = pd.DataFrame(data)
# Filter for cities "New York" or "Paris" (IN)
filtered_df_in = df[df['City'].isin(['New York', 'Paris'])]
print("People living in New York or Paris:")
print(filtered_df_in)
# Filter for cities NOT "New York" or "Paris" (NOT IN) - Using ~isin()
filtered_df_notin_1 = df[~df['City'].isin(['New York', 'Paris'])]
print("\nPeople NOT living in New York or Paris (using ~isin()):")
print(filtered_df_notin_1)
# Filter for cities NOT "New York" or "Paris" (NOT IN) - Using boolean indexing
filtered_df_notin_2 = df[df['City'].isin(['New York', 'Paris']) == False]
print("\nPeople NOT living in New York or Paris (using boolean indexing):")
print(filtered_df_notin_2)
This code creates a DataFrame df
with data about names, cities, and ages. It then demonstrates three filtering methods:
- Filtering for "New York" or "Paris" using
isin()
.
The output will show the original DataFrame, the filtered DataFrame for cities "New York" or "Paris", and the filtered DataFrames for cities not "New York" or "Paris" using both methods. This should give you a practical understanding of how these filtering techniques work!
This approach uses boolean expressions to directly select rows based on conditions. Here's an example for filtering cities "New York" or "Paris":
filtered_df = df[ (df['City'] == 'New York') | (df['City'] == 'Paris')]
This creates a boolean Series where True
indicates rows matching the city criteria. The |
operator performs a bitwise OR to combine the conditions.
Vectorized Comparisons:
For simple comparisons, you can leverage vectorized operations. Here's an example filtering for cities "New York" or "Paris":
cities_to_include = ['New York', 'Paris']
filtered_df = df[df['City'].isin(cities_to_include)]
This creates a list of desired cities and then uses isin()
with this list for filtering. While similar to the original method, it separates the comparison values into a variable for potential reusability.
.query() Method:
For more complex filtering logic, pandas offers the .query()
method that allows writing SQL-like expressions:
filtered_df = df.query("City == 'New York' or City == 'Paris'")
This method lets you define filtering conditions directly within a string, which can be helpful for intricate logic.
Choosing the Right Method:
isin()
is generally the most efficient and readable option for simple "IN" or "NOT IN" filtering.- Boolean indexing offers more control over individual conditions but might be less readable for complex logic.
- Vectorized comparisons can improve readability by separating filtering values into variables.
.query()
is powerful for intricate logic but might require some practice with its syntax.
The best approach often depends on the complexity of your filtering criteria, data size, and personal preference for code readability.
python pandas dataframe