Merging and Concatenating: Mastering DataFrame Combination in pandas
Combining DataFrames in pandas
pandas offers two primary methods for combining DataFrames:
-
Concatenation (using concat()):
- This method appends DataFrames either vertically (adding rows) or horizontally (adding columns).
- DataFrames must have compatible shapes (number of rows/columns) for vertical concatenation, while they can have different column names for horizontal concatenation.
import pandas as pd df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) df2 = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]}) # Vertical concatenation (adding rows) df_combined_v = pd.concat([df1, df2]) print(df_combined_v) # Horizontal concatenation (adding columns, assuming compatible shapes) df_combined_h = pd.concat([df1, df2.iloc[:, 0:2]], axis=1) # Specify axis=1 for columns print(df_combined_h)
-
Merging (using merge()):
- This method combines DataFrames based on one or more common columns (join keys).
- DataFrames can have different shapes, but they must share the join key(s).
- There are different join types (
inner
,left
,right
,outer
) that determine how rows are included/excluded based on missing values in the join key(s).
df3 = pd.DataFrame({'A': [1, 4, 5], 'X': [10, 20, 30]}) df4 = pd.DataFrame({'B': [4, 5, 6], 'Y': [40, 50, 60]}) # Inner join (only rows with matching values in both join keys are kept) merged_inner = df3.merge(df4, on='B', how='inner') print(merged_inner) # Left join (all rows from the left DataFrame are kept, even if there's no match in the right DataFrame) merged_left = df3.merge(df4, on='B', how='left') print(merged_left)
Choosing the Right Method:
- Use concatenation if you want to simply stack DataFrames together without a specific join condition.
- Use merging if you want to combine DataFrames based on a relationship between columns (join keys).
Additional Considerations:
- When concatenating horizontally, DataFrames should have the same number of rows (unless you specifically handle missing values).
- Merging offers various join types to control how rows are handled based on missing values in the join keys.
Concatenation:
Vertical concatenation (adding rows):
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})
# Concatenate vertically (adding rows)
df_combined_v = pd.concat([df1, df2])
print(df_combined_v)
This code outputs:
A B C D
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
Horizontal concatenation (adding columns, assuming compatible shapes):
df_combined_h = pd.concat([df1, df2.iloc[:, 0:2]], axis=1) # Specify axis=1 for columns
print(df_combined_h)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Merging:
Inner join (only rows with matching values in both join keys are kept):
df3 = pd.DataFrame({'A': [1, 4, 5], 'X': [10, 20, 30]})
df4 = pd.DataFrame({'B': [4, 5, 6], 'Y': [40, 50, 60]})
# Inner join (only rows with matching values in both join keys are kept)
merged_inner = df3.merge(df4, on='B', how='inner')
print(merged_inner)
A X B Y
1 4 20 4 40
2 5 30 5 50
Left join (all rows from the left DataFrame are kept, even if there's no match in the right DataFrame):
merged_left = df3.merge(df4, on='B', how='left')
print(merged_left)
A X B Y
0 1 10 NaN NaN
1 4 20 4 40
2 5 30 5 50
Additional Examples:
- Concatenating with different column names:
df5 = pd.DataFrame({'X': [100, 200, 300]})
df_combined_diff = pd.concat([df1, df5], axis=1) # Handles different column names
print(df_combined_diff)
- Merging with multiple join keys:
df6 = pd.DataFrame({'A': [1, 1, 4], 'X': [10, 11, 20], 'Z': [70, 80, 90]})
merged_multi = df3.merge(df6, on=['A', 'X'], how='inner')
print(merged_multi)
I hope these examples provide a clear understanding of combining DataFrames in pandas!
Using Dictionary Comprehension (for Simple Concatenation):
If you're dealing with a small number of DataFrames and want a concise way to concatenate them vertically, you can use dictionary comprehension to create a list of DataFrames and then concatenate them using concat
. This can be more readable for simple cases.
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'B': [4, 5, 6]})
df3 = pd.DataFrame({'C': [7, 8, 9]})
all_dfs = [df1, df2, df3]
combined_df = pd.concat([df for df in all_dfs])
print(combined_df)
Appending Rows (for Incremental Concatenation):
If you're building a DataFrame row by row and want to append them efficiently, you can use the append
method. However, keep in mind that append
is deprecated in newer pandas versions, so consider using concat
with ignore_index=True
instead.
import pandas as pd
df = pd.DataFrame(columns=['A', 'B'])
for i in range(3):
new_row = pd.Series([i + 1, i * 2], index=df.columns)
df = df.append(new_row, ignore_index=True) # Use ignore_index=True to avoid duplicate indices
print(df)
Looping and Concatenating (for Conditional Concatenation):
For more complex concatenation scenarios where you need to conditionally include/exclude DataFrames based on certain criteria, you can use a loop and concatenate based on those conditions.
df_list = [df1, df2, df3]
combined_df = pd.DataFrame()
for df in df_list:
if df.shape[0] > 1: # Include only DataFrames with more than 1 row
combined_df = pd.concat([combined_df, df])
print(combined_df)
These alternative methods offer different ways to combine DataFrames depending on your specific requirements. However, concat
and merge
remain the most versatile and efficient approaches for most use cases.
python pandas