#01Python - Missing Values (NaN | Null)
Ana Carolina Branco Neumann
Posted on August 16, 2023
Types of Data Absence
There are different patterns of data absence:
- MCAR (Missing Completely At Random): MCAR stands for "Missing Completely At Random." In this case, the data's absence is entirely random and not related to any other variable in the dataset. This means that the probability of a value being missing is the same for all observations, and it doesn't depend on unobserved values in other variables. In other words, missing data doesn't introduce any systematic bias into the analysis.
- MAR (Missing At Random): MAR stands for "Missing At Random." The absence of data might be related to other observed variables but isn't related to the missing value itself. In other words, the probability of a value being missing may depend on the information available in other variables, but it doesn't depend on the actual value that's missing. Even if there's a relationship between missing data and other variables, as long as these variables are present in the dataset, no systematic bias is introduced into the analysis.
- MNAR (Missing Not At Random): MNAR means "Missing Not At Random." It indicates that the absence of data is related to the missing value itself, and this relationship can't be explained by other variables in the dataset. In other words, the probability of a value being missing depends on the actual value that's missing, regardless of other observed variables. Missing data introduces a systematic bias into the analysis due to the relationship between the missing data and the desired outcome.
It's important to understand the type of data absence when working with a dataset, as each type of absence requires different treatment strategies or data imputation methods to deal with missing values. Knowing the types of data absence can also help in correctly interpreting results and avoiding false or biased conclusions.
Note: "No systematic bias introduced" means that the absence of data doesn't affect the analysis in a biased or systematic manner. In other words, the missing data doesn't consistently influence the results.
Identifying Null Values
To identify missing values, we use the isnull()
method of pandas to check which values are null in the dataframe. For example:
df.isnull()
This command returns a dataframe with the same format as the original, but with boolean values indicating whether each element is null or not. The .isna()
method can also be used, as described below.
Identifying NaN Values
Nan → Not a Number.
isna()
is a pandas method that returns a boolean matrix indicating which elements are missing values (NaN) or null. It has the same functionality as the isnull()
method.
The isna()
function can be applied to an entire dataframe or to a specific series within the dataframe. When used on the entire dataframe, it returns a dataframe with the same shape as the input, where each element is replaced by True
if it's a missing value or False
otherwise.
Here's an example of using the isna()
method:
df = pd.DataFrame({'A': [1, 2, None], 'B': [3, None, 5]})
print(df.isna())
Counting Missing Values
To get an overview of missing values in each column, you can use the sum()
method in combination with the isnull()
method. For example:
df.isnull().sum()
This command returns the total number of missing values in each column. The same can be applied to the isna()
command. For example:
df.isna().sum()
Visualization of Missing Data [Missingno]
The missingno
library is a useful tool for visualizing missing data patterns in a dataframe. It generates charts that help identify patterns of missing values and understand the distribution of these values in a dataframe. Some of its key charts are:
- Matrix Plot: Shows the presence or absence of values in each cell of the dataframe. Each row represents a sample or record, and each column represents a variable, with empty/blank cells indicating missing values. This makes patterns and correlations between missing values visible.
- Bar Chart of Missing Values: Displays the count of missing values for each variable. It shows the proportion of missing values relative to the total number of available observations for each variable. It helps identify variables with a significant number of missing values.
- Heatmap: Uses colors to represent the presence or absence of values in a dataframe. It's useful when working with large datasets, allowing the visualization of missing value distribution across multiple variables.
The missingno
library also provides other visualizations, such as correlation dendrograms and line plots to track missing values over time.
Here's an example of using the missingno
library:
import missingno as msno
import matplotlib.pyplot as plt
# Matrix plot
msno.matrix(df)
plt.show()
# Bar chart of missing values
msno.bar(df)
plt.show()
# Heatmap
msno.heatmap(df)
plt.show()
Setting a Threshold for Missing Values per Column
There's no standard or rule to determine the threshold of missing values per column, as it can vary depending on the problem, data nature, and analysis requirements.
However, there are some approaches to defining this threshold:
- Percentage Threshold: You can set a percentage threshold, for example, allowing a column to have up to 5% (or any other value) of missing values. If the proportion of missing values in a column exceeds this threshold, actions like value imputation or column deletion can be taken.
- Domain Analysis: Certain columns might be more critical and require fewer missing values, while others may have more leeway for data absence. For instance, in a medical data dataframe, variables like age or gender might be considered essential, whereas other columns with more specific information might tolerate more missing values.
- Impact on Results: Consider the impact of missing values on final results. If a column contains critical information for the problem at hand or is necessary for constructing the result, it's advisable to have a lower threshold for missing values.
Documenting the missing data treatment process, especially the decision about the missing values threshold, is important.
Dealing with Missing Values
There are several strategies to handle missing values. Here are some common options:
Removing Rows or Columns
If the missing values are in a small number of rows or columns, you can choose to remove them. Use the dropna()
method to do this. For example, to remove all rows containing at least one null value:
# Dropping rows with missing values:
df = df.dropna()
To remove columns with at least one null value, specify the axis=1
parameter:
# Dropping columns with missing values:
df = df.dropna(axis=1)
Below is an example of dropping rows with missing values below a 5% threshold in a dataframe and analyzing if there are columns with missing values above the defined threshold:
# Setting threshold:
threshold = len(df)*0.05
# Columns to drop rows of missing values (threshold < 5%):
cols_to_drop_na = df.columns[df.isna().sum() <= threshold
Replacing Missing Values with Statistical Values
If you prefer to keep all rows and columns, you can fill the missing values with specific values. Use the fillna()
method to fill null values. In the example below, all null values are filled with '0':
df = df.fillna(0)
You can also fill with other values, such as column statistics:
# Fill missing values with column mean
df_filled_mean = df.fillna(df.mean())
# Fill missing values with column median
df_filled_median = df.fillna(df.median())
# Fill missing values with column standard deviation
df_filled_std = df.fillna(df.std())
# Fill missing values with column mode (most frequent value)
df_filled_mode = df.apply(lambda col: col.fillna(col.mode()[0]) if col.isna().any() else col)
Or advanced statistics like weighted median:
# Define weights for weighted median
weights = [1, 2, 1, 3, 1]
# Function to calculate weighted median
def weighted_median(values, weights):
sorted_indices = np.argsort(values)
sorted_values = np.array(values)[sorted_indices]
sorted_weights = np.array(weights)[sorted_indices]
cumsum_weights = np.cumsum(sorted_weights)
total_weight = cumsum_weights[-1]
median_idx = np.argmax(cumsum_weights >= total_weight / 2)
return sorted_values[median_idx]
# Fill missing values with column weighted median
df_filled_weighted_median = df.apply(lambda col: col.fillna(weighted_median(col.dropna(), weights)) if col.isna().any() else col)
But be aware that filling missing values with other values can lead to misleading analysis insights, depending on the amount of filled values.
Statistical Measures by Subgroups
Filling missing values with statistical measures segmented by subgroups within the dataframe is a useful technique when you want to impute missing values based on specific characteristics of data subsets. This allows considering data heterogeneity and avoiding distortions when filling missing values with general statistical measures.
Here's an example of how to fill missing values with the mean of a group:
# Calculate group mean
group_means = df.groupby('Group')['Value'].mean()
# Fill missing values with group mean
df_filled_segmented_mean = df.groupby('Group')['Value'].apply(lambda x: x.fillna(x.mean()))
print("DataFrame filled with segmented mean by group:\n", df_filled_segmented_mean)
This code can be adapted for other statistical measures like median, standard deviation, mode, among others. Simply replace the appropriate statistical function within the apply()
function.
Interpolating Missing Values
Another option is to use interpolation to fill missing values based on existing values in the columns. The interpolate()
method does this automatically. For example:
df = df.interpolate()
Check Missing Values Again After Transformations
After performing missing value treatment steps, check again if there are any remaining null values in the dataframe to ensure that missing values have been properly handled.
# For null values:
df.isnull().sum()
# Or:
df.isna().sum()
Posted on August 16, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.