Size Matters, But So Does Data Validity: A Guide to size and count in pandas
Understanding size and count:
-
size:
- Counts all elements in the object, including missing values (NaN).
- Returns a single integer representing the total number of elements.
- Example:
df.size
returns 8 for a DataFrame with 4 rows and 2 columns.
-
count:
- Counts only non-null (valid) values, excluding missing values (NaN).
- Returns:
- A Series with the count of non-null values for each column if used on a DataFrame.
- A single integer representing the count of non-null values if used on a Series.
- Example:
df.count()
returns a Series with values 4 for both 'A' and 'B' columns.
Key Differences:
Feature | size | count |
---|---|---|
Missing values | Counts all elements, including NaN | Excludes NaN values |
Output type | Single integer | Series (for DataFrames) or single integer (for Series) |
Example:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, np.nan, 8]})
print("Shape of the dataframe:", df.shape)
print("Size of the dataframe:", df.size)
print("Count of the dataframe:")
print(df.count())
This code outputs:
Shape of the dataframe: (4, 2)
Size of the dataframe: 8
Count of the dataframe:
A 3
B 3
dtype: int64
As you can see, size
is 8 because it counts all elements, including the NaN value in column 'B'. On the other hand, count
excludes NaN values, resulting in a Series where both 'A' and 'B' have counts of 3.
Choosing between size and count:
- Use
size
when you want to know the total number of elements, regardless of missing values. This can be useful for tasks like iterating over all elements. - Use
count
when you want to know the number of valid (non-null) values. This is helpful for understanding how many usable data points you have in each column.
I hope this explanation clarifies the difference between size
and count
in pandas!
python pandas numpy