Pandas - Handling Missing Values - 7 Days of Pandas

Welcome to the fourth article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the first article of the series, we looked at how to read and write CSV files with Pandas.

In the second article, we looked at how to perform basic data manipulation.

In the third article, we looked at how to perform EDA (exploratory data analysis) with Pandas.

In this tutorial, we will look at how to handle missing values in data.

When working on some data, it's not uncommon to find missing values in the data. Missing values can occur in data for a variety of reasons, for example, error in data capture, encoding issues, etc. It's important to deal with missing values before you proceed on further analyzing data and it's a major step in data preprocessing.

In this tutorial, we will cover the following topics:

Identifying missing values.
Handling missing values.
- Filling missing values.
- Removing missing values.

Before we begin, let's first import pandas and create a sample dataframe that we will be using throughout this tutorial.

import pandas as pd
import numpy as np

# employee data
data = {
    "Name": ["Tim", "Shaym", "Noor", "Esha", "Sam", "James", "Lily"],
    "Age": [26, 28, 27, 32, 24, None, 33],
    "Department": ["Marketing", "Product", "Product", "HR", "Product", np.nan, "Marketing"],
    "Salary": [60000, np.nan, 82000, np.nan, 58000, 55000, 65000]
}

# create pandas dataframe
df = pd.DataFrame(data)

# display the dataframe
df

	Name	Age	Department	Salary
0	Tim	26.0	Marketing	60000.0
1	Shaym	28.0	Product	NaN
2	Noor	27.0	Product	82000.0
3	Esha	32.0	HR	NaN
4	Sam	24.0	Product	58000.0
5	James	NaN	NaN	55000.0
6	Lily	33.0	Marketing	65000.0

Identifying missing values

To identify missing values in a pandas DataFrame, you can use the pandas isna() method, which returns a boolean mask indicating the presence of missing values. You can then use this mask to select the rows or columns with missing values.

For example, let's check which values in the above dataframe are missing using the isna() function.

df.isnull()

	Name	Age	Department	Salary
0	False	False	False	False
1	False	False	False	True
2	False	False	False	False
3	False	False	False	True
4	False	False	False	False
5	False	True	True	False
6	False	False	False	False

You can see the resulting boolean mask.

To check which columns in the dataframe have missing value, apply the any() function on the resulting boolean dataframe with axis=0.

df.isna().any(axis=0)

Name          False
Age            True
Department     True
Salary         True
dtype: bool

We see that only the "Name" column doesn't have any missing values.

Handling missing values

Handling missing values is an important step in the data preparation pipeline. Generally, there are two approaches to handle missing values -

Fill the missing value with some appropriate value (for example, a constant or mean, median, etc. for continuous variables, and mode for categorical fields).
Remove the missing values (remove the records with missing values).

Let's now look at how to do both of them in pandas.

Filling missing values

To fill missing values in a pandas DataFrame, you can use the pandas fillna() method. This method allows you to specify a value to fill the missing values with, or a method for imputing the missing values.

For example, let's see what we get if we fill the missing values with 0.

df.fillna(0)

	Name	Age	Department	Salary
0	Tim	26.0	Marketing	60000.0
1	Shaym	28.0	Product	0.0
2	Noor	27.0	Product	82000.0
3	Esha	32.0	HR	0.0
4	Sam	24.0	Product	58000.0
5	James	0.0	0	55000.0
6	Lily	33.0	Marketing	65000.0

It returns a dataframe with the missing values filled with the constant value. Note that the fillna() function didn't modify the original dataframe in-place. It returned the resulting dataframe after filling the missing values.

You can also specify different values for different columns when filling missing values.
For exmaple, let's fill missing values in "Age" and "Salary" columns with their respective means and the missing value in the "Department" column with its mode (the most frequent value).

df.fillna({'Age': df['Age'].mean(), 'Salary': df['Salary'].mean(), 'Department': df['Department'].mode()[0]})

	Name	Age	Department	Salary
0	Tim	26.000000	Marketing	60000.0
1	Shaym	28.000000	Product	64000.0
2	Noor	27.000000	Product	82000.0
3	Esha	32.000000	HR	64000.0
4	Sam	24.000000	Product	58000.0
5	James	28.333333	Product	55000.0
6	Lily	33.000000	Marketing	65000.0

Dropping rows with missing values

Another strategy of handling missing values is to remove the rows that contain missing values. This is used when the proportion of missing values is comparitively less and we can afford to discard that data.

Use the pandas dropna() function to remove rows with missing values.

df.dropna()

	Name	Age	Department	Salary
0	Tim	26.0	Marketing	60000.0
2	Noor	27.0	Product	82000.0
4	Sam	24.0	Product	58000.0
6	Lily	33.0	Marketing	65000.0

This is how the dataframe looks after removing rows with any missing values.

Blog

Pandas - Handling Missing Values - 7 Days of Pandas

Piyush Raj

Identifying missing values

Handling missing values

Filling missing values

Dropping rows with missing values

Join Our Newsletter. No Spam, Only the good stuff.

Related