1 Answers
๐ Understanding Imputation with fillna()
When working with datasets, you'll often encounter missing values, represented as NaN (Not a Number). The fillna() method in pandas helps you replace these missing values with calculated values. Let's explore how to use it with mean, median, and mode.
๐ Imputing with Mean
The mean is the average of all values in a column. Using the mean to impute missing values involves replacing each NaN with the calculated average of the existing values in that column.
-
โ
- ๐ข Definition: The mean is calculated as the sum of all values divided by the number of values. Mathematically, it's represented as: $ \text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n} $ โ
- ๐ป Usage:
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
โ - ๐ก Pros: Simple and quick to calculate. โ
- โ ๏ธ Cons: Sensitive to outliers, which can skew the mean and lead to biased imputation.
๐ Imputing with Median
The median is the middle value in a sorted list of numbers. If there are an even number of values, the median is the average of the two middle values. Using the median to impute missing values involves replacing each NaN with the calculated median of the existing values in that column.
-
โ
- ๐ข Definition: The median is the central value separating the higher half from the lower half of a data sample. โ
- ๐ป Usage:
df['column_name'].fillna(df['column_name'].median(), inplace=True)
โ - ๐ก๏ธ Pros: More robust to outliers than the mean. โ
- โ ๏ธ Cons: May not accurately represent the central tendency if the data distribution is highly skewed or multimodal.
๐งฎ Imputing with Mode
The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), multiple modes (multimodal), or no mode (if all values appear only once). Using the mode to impute missing values involves replacing each NaN with the most frequent value in that column.
-
โ
- ๐ข Definition: The mode is the value that appears most often in a dataset. โ
- ๐ป Usage:
df['column_name'].fillna(df['column_name'].mode()[0], inplace=True)(Note:.mode()returns a Series, so we take the first element[0])
โ - ๐ก Pros: Useful for categorical data or discrete numerical data. โ
- โ ๏ธ Cons: Not suitable for continuous numerical data with a relatively uniform distribution, as it might introduce bias.
| Feature | Mean | Median | Mode |
|---|---|---|---|
| Definition | Average of all values | Middle value in a sorted list | Most frequent value |
| Calculation | Sum of values / Number of values | Requires sorting; average of two middle values if even number of data points | Count occurrences of each value |
| Sensitivity to Outliers | Highly sensitive | Robust | Not applicable |
| Best Use Case | Normally distributed data with few outliers | Skewed data or data with outliers | Categorical or discrete data |
| Potential Bias | Can be biased by outliers | Less biased than mean, but can still introduce bias in skewed data | Can introduce significant bias if the mode is not representative |
๐ Key Takeaways
-
โ ๐ก
- Mean: Use for normally distributed data without significant outliers. โ ๐ก๏ธ
- Median: Preferred for data with outliers or skewed distributions. โ ๐งฎ
- Mode: Best suited for categorical or discrete data.
Join the discussion
Please log in to post your answer.
Log InEarn 2 Points for answering. If your answer is selected as the best, you'll get +20 Points! ๐