Pandas - Handling Missing Values - 7 Days of Pandas

piyushraj

Piyush Raj

Posted on December 24, 2022

Pandas - Handling Missing Values - 7 Days of Pandas

Welcome to the fourth article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the first article of the series, we looked at how to read and write CSV files with Pandas.

In the second article, we looked at how to perform basic data manipulation.

In the third article, we looked at how to perform EDA (exploratory data analysis) with Pandas.

In this tutorial, we will look at how to handle missing values in data.

When working on some data, it's not uncommon to find missing values in the data. Missing values can occur in data for a variety of reasons, for example, error in data capture, encoding issues, etc. It's important to deal with missing values before you proceed on further analyzing data and it's a major step in data preprocessing.

In this tutorial, we will cover the following topics:

  1. Identifying missing values.
  2. Handling missing values.
    • Filling missing values.
    • Removing missing values.

Before we begin, let's first import pandas and create a sample dataframe that we will be using throughout this tutorial.

import pandas as pd
import numpy as np

# employee data
data = {
    "Name": ["Tim", "Shaym", "Noor", "Esha", "Sam", "James", "Lily"],
    "Age": [26, 28, 27, 32, 24, None, 33],
    "Department": ["Marketing", "Product", "Product", "HR", "Product", np.nan, "Marketing"],
    "Salary": [60000, np.nan, 82000, np.nan, 58000, 55000, 65000]
}

# create pandas dataframe
df = pd.DataFrame(data)

# display the dataframe
df
Enter fullscreen mode Exit fullscreen mode
Name Age Department Salary
0 Tim 26.0 Marketing 60000.0
1 Shaym 28.0 Product NaN
2 Noor 27.0 Product 82000.0
3 Esha 32.0 HR NaN
4 Sam 24.0 Product 58000.0
5 James NaN NaN 55000.0
6 Lily 33.0 Marketing 65000.0

Identifying missing values

To identify missing values in a pandas DataFrame, you can use the pandas isna() method, which returns a boolean mask indicating the presence of missing values. You can then use this mask to select the rows or columns with missing values.

For example, let's check which values in the above dataframe are missing using the isna() function.

df.isnull()
Enter fullscreen mode Exit fullscreen mode
Name Age Department Salary
0 False False False False
1 False False False True
2 False False False False
3 False False False True
4 False False False False
5 False True True False
6 False False False False

You can see the resulting boolean mask.

To check which columns in the dataframe have missing value, apply the any() function on the resulting boolean dataframe with axis=0.

df.isna().any(axis=0)
Enter fullscreen mode Exit fullscreen mode
Name          False
Age            True
Department     True
Salary         True
dtype: bool
Enter fullscreen mode Exit fullscreen mode

We see that only the "Name" column doesn't have any missing values.

Handling missing values

Handling missing values is an important step in the data preparation pipeline. Generally, there are two approaches to handle missing values -

  • Fill the missing value with some appropriate value (for example, a constant or mean, median, etc. for continuous variables, and mode for categorical fields).
  • Remove the missing values (remove the records with missing values).

Let's now look at how to do both of them in pandas.

Filling missing values

To fill missing values in a pandas DataFrame, you can use the pandas fillna() method. This method allows you to specify a value to fill the missing values with, or a method for imputing the missing values.

For example, let's see what we get if we fill the missing values with 0.

df.fillna(0)
Enter fullscreen mode Exit fullscreen mode
Name Age Department Salary
0 Tim 26.0 Marketing 60000.0
1 Shaym 28.0 Product 0.0
2 Noor 27.0 Product 82000.0
3 Esha 32.0 HR 0.0
4 Sam 24.0 Product 58000.0
5 James 0.0 0 55000.0
6 Lily 33.0 Marketing 65000.0

It returns a dataframe with the missing values filled with the constant value. Note that the fillna() function didn't modify the original dataframe in-place. It returned the resulting dataframe after filling the missing values.

You can also specify different values for different columns when filling missing values.
For exmaple, let's fill missing values in "Age" and "Salary" columns with their respective means and the missing value in the "Department" column with its mode (the most frequent value).

df.fillna({'Age': df['Age'].mean(), 'Salary': df['Salary'].mean(), 'Department': df['Department'].mode()[0]})
Enter fullscreen mode Exit fullscreen mode
Name Age Department Salary
0 Tim 26.000000 Marketing 60000.0
1 Shaym 28.000000 Product 64000.0
2 Noor 27.000000 Product 82000.0
3 Esha 32.000000 HR 64000.0
4 Sam 24.000000 Product 58000.0
5 James 28.333333 Product 55000.0
6 Lily 33.000000 Marketing 65000.0

Dropping rows with missing values

Another strategy of handling missing values is to remove the rows that contain missing values. This is used when the proportion of missing values is comparitively less and we can afford to discard that data.

Use the pandas dropna() function to remove rows with missing values.

df.dropna()
Enter fullscreen mode Exit fullscreen mode
Name Age Department Salary
0 Tim 26.0 Marketing 60000.0
2 Noor 27.0 Product 82000.0
4 Sam 24.0 Product 58000.0
6 Lily 33.0 Marketing 65000.0

This is how the dataframe looks after removing rows with any missing values.

💖 💪 🙅 🚩
piyushraj
Piyush Raj

Posted on December 24, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related