How to compare 2 datasets with pandas-profiling š¼
DCAI Community
Posted on December 20, 2022
Visualization is the cornerstone of Exploratory Data Analysis
When facing a new, unknown dataset, visual inspection allows us to get a feel of the available information, draw some patterns regarding the data, and diagnose several issues that we might need to address.
pandas-profiling has been the indispensable swiss-knife in every data scientistās tool belt. However, something that seemed to be missing was the ability to compare different reports side-by-side, which would help us continuously assess the transformations performed during EDA!
Side-by-side comparison: the wait is over!
pandas-profiling now supports a "side-by-side" comparison feature that lets us automate the comparison process with a single line of code.
In this blogpost, I'll put you up to speed with this new functionality and show you how we can use it to produce faster and smarter transformations on our data.
Iāll be using the HCC Dataset, which I have personally collected during my MSc. For this particular use case, Iāve artificially introduced some additional data quality issues to show you how visualisation can help us detect them and guide us towards their efficient mitigation.
The full code and examples can be found on this GitHub repository so you can follow along the tutorial.
pandas-profiling: EDA at your fingertips
Weāll start by profiling the HCC dataset and investigating the data quality issues suggested in the report:
pip install pandas-profiling==3.5.0
import pandas as pd
from pandas_profiling import ProfileReport
# Read the HCC Dataset
df = pd.read_csv("hcc.csv")
# Produce the data profiling report
original_report = ProfileReport(df, title='Original Data')
original_report.to_file("original_report.html")
According to the "Alerts" overview, there are four main types of potential issues that need to be addressed:
- Duplicates: 4 duplicate rows in data;
- Constant: Constant value ā999ā in 'O2';
- High Correlation: Several features marked as highly correlated;
- Missing: Missing Values in āFerritinā
The validity of each potential problem (as well as the need to find a mitigation strategy for it) depends on the specific use case and domain knowledge. In our case, with the exception of the "high correlation" alerts, which would require further investigation, the remaining alerts seem to reflect true data quality issues and can be tackled using a few practical solutions. Let's see how!
Removing Duplicate Rows
Depending on the nature of the domain, there might be records that have the same values without it being an error. However, considering that some of the features in this dataset are quite specific and refer to an individualās biological measurements (e.g., "Hemoglobin", "MCV", "Albumin"), itās unlikely that several patients report the same exact values for all features. Letās start by dropping these duplicates from the data:
# Drop duplicate rows
df_transformed = df.copy()
df_transformed = df_transformed.drop_duplicates()
Removing Irrelevant Features
The constant values in O2 also reflect a true inconsistency in data. There may be two main reasons for such an error to arise: either the O2 values were measured and stored automatically in the database and the pulse oximeter failed, or the person taking this measurement kept evaluating repeated erroneous messages and simply coded them as ā999ā, which is an absurd value (O2 values range from 0% to 100%). In all cases, these values are erroneous and should therefore be removed from the analysis:
# Remove O2
df_transformed = df_transformed.drop(columns='O2')
Missing Data Imputation
As frequently happens with medical data, HCC dataset also seems extremely susceptible to missing data. A simple way to address this issue (avoiding removing incomplete records or entire features) is data imputation. Weāll use mean imputation to fill in the absent observations, as it is the most common and simple of statistical imputation techniques and often serves as a baseline method:
# Impute Missing Values
from sklearn.impute import SimpleImputer
mean_imputer = SimpleImputer(strategy="mean")
df_transformed['Ferritin'] = mean_imputer.fit_transform(df_transformed['Ferritin'].values.reshape(-1,1))
Side-by-side comparison: faster and smarter iterations on your data
Now for the fun part! After implementing the first batch of transformations to our dataset, we're ready to assess their impact on the overall quality of our data.
This is where the pandas-profiling report functionality comes in handy: the comparison between the original versus the transformed data can now be automatically performed through the .compare
method of the ProfileReport
:
transformed_report = ProfileReport(df_transformed, title="Transformed Data")
comparison_report = original_report.compare(transformed_report)
comparison_report.to_file("original_vs_transformed.html")
How did these transformations impacted the quality of our data? And What would we find by further investigating each of the transformations performed? Letās dive deeper into the comparison results!
Dataset Overview
The comparison report shows both datasets ("Original Data" and "Transformed Data") and distinguishes their properties by respectively using a blue or red colour in titles and graph plots.
As shown in the report, the transformed dataset contains one less categorical feature ("O2" was removed), 165 observations (versus the original 171 containing duplicates) and no missing values (in contrast with the 79 missing observations in the original dataset).
Duplicate Records
Conversely to the original data, there are no duplicate patient records in the transformed data: our complete and accurate case base can move onward to the modeling pipeline, avoiding data overfitting.
Irrelevant Features
Features that have not been subjected to any transformation remain the same (as shown below for "Encephalopathy"): original and transformed data summary statistics do not change. In turn, removed features are only presented for the original data (shown in blue), as is the case of "O2".
Missing Values
Contrarily to the original data, there are no missing observations after the data imputation was performed. Note how both the nullity count and matrix show the differences between both versions of the data: in the transformed data, "Ferritin" has now 165 complete values and no blanks can be found in the nullity matrix.
A deeper investigation on data properties
If we were to compare all features prior and before the data transformations performed, we would find an insightful detail in what concerns missing data imputation.
When analysing the "Ferritin" values in higher detail, weād see how imputing values with the mean has distorted the original data distribution, which is undesirable:
This artefact is also observed through the visualisation of interactions and correlations, where daft interaction patterns and higher correlation values emerge in the relationship between "Ferritin" and the remaining features.
This comes to show that the comparison report is not only useful to highlight the differences introduced after data transformations, but it provides several visual cues that lead us towards important insights regarding those transformations: in this case, a more specialised data imputation strategy should be considered.
Conclusion
Throughout this small use case, I've covered the usefulness of comparing two sets of data within the same profiling report to highlight the data transformations performed during EDA and evaluate their impact on data quality. Nevertheless, the applications of this functionality are endless, as the need to (re)iterate on feature assessment and visual inspection is vital for data-centric solutions!
Made with ā¤ļø by the Data-Centric AI Community
Thank you for reading! If you enjoyed this tutorial and plan to use pandas-profiling in your data quests, please āļø our repository and join the discussion on our Discord server!
Posted on December 20, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.