Tomoyuki Aota
Posted on May 24, 2018
(A Japanese translation is available here.)
During data analysis, we need to deal with missing values. Handling missing data is so profound that it will be an entire topic of a book. However, before doing anything to missing values, we need to know the pattern of occurrence of missing values. This article describes easy visualization techniques for missing value occurrence with Python. The techniques are useful in early stages of exploratory data analysis.
I've uploaded a Jupyter notebook in my GitHub repo. You can run it using Binder by clicking the badge below.
Prerequisite
I'm using the Titanic train dataset from Kaggle as an example. To begin with, following code is assumed to be executed.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('train.csv')
# Confirm the number of missing values in each column.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
Method 1: seaborn.heatmap
The first method is by seaborn.heatmap
. The next single-line code will visualize the location of missing values.
sns.heatmap(df.isnull(), cbar=False)
Against Index, I can see that
- Age column has missing values with variation in occurrence,
- Cabin column are almost filled with missing values with variation in occurrence, and
- Embarked column has few missing values in the beginning part.
This is not the case for this Titanic dataset, but especially in time series data, we need know if the occurrence of missing values are sparsely located or located as a big chunk. This heatmap visualization immediately tells us such tendency. Also, if more than 2 columns have correlation in missing value locations, such correlation will be visualized. (Again, not the case for this dataset, but it is important to know the fact that there is no such correlation in this dataset.)
This single-line code tells us a lot of information of missing value occurrence.
Method 2: missingno module
If you want to proceed further, missingno module will be useful.
To begin with, install and import it.
pip install missingno
import missingno as msno
If you want the similar result to seaborn.heatmap
described earlier, use missingno.matrix
.
msno.matrix(df)
In addition to the heatmap, there is a bar on the right side of this diagram. This is a line plot for each row's data completeness. In this dataset, all rows have 10 - 12 valid values and hence 0 - 2 missing values.
Also, missingno.heatmap
visualizes the correlation matrix about the locations of missing values in columns.
msno.heatmap(df)
missingno
module has more features, such as the bar chart of the number of missing values in each column and the dendrogram generated from the correlation of missing value locations. For more information, README is a good primer.
Closing
Two easy visualization methods are described in this article. seaborn.heatmap
is the first choice as it requires seaborn
only, but it you need more, missingno module will help you.
Posted on May 24, 2018
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.