The Gemika's Magical Guide to Sorting Hogwarts Students using the Decision Tree Algorithm (Part #5)

5. Unveiling the Mysteries: Data Exploration (EDA) - Missing Values 🔍

Welcome back, young wizards and witches! As we gather around the glowing embers of the Gryffindor common room fire, it’s time to delve deeper into the magical world of data exploration. Our enchanted dataset holds many secrets, and today we shall reveal them through the art of Data Exploration and Analysis. This is where we ensure our data is as clean and pure as a unicorn's tears, visualize its hidden wonders, and uncover patterns that lie beneath the surface. 🌟✨

5.1 Identifying Missing Values

Every great spell requires the right ingredients, and so too does our dataset. Missing values are like gaps in a spellbook—unfilled, they can lead to disastrous results. First, we must identify these missing values and decide how to handle them. Shall we fill them with the most common value? Or perhaps the mean? Let's find out. And by any chance if you missed the previous part on how to obtain copy of the ancient scroll (dataset), you may find the magical spell from the previous story found here.

# Importing necessary libraries
import pandas as pd
import numpy as np

# Loading the dataset
dataset_path = 'data/hogwarts-students-01.csv'  # Path to our dataset
hogwarts_df = pd.read_csv(dataset_path)

# Checking for missing values
print(hogwarts_df.isnull().sum())

Unamed: 0              0
name                   0
gender                 0
age                    0
origin                 0
specialty              0
house                  0
blood_status           0
pet                   25
wand_type              0
patronus               2
quidditch_position    42
boggart                0
favorite_class         1
house_points           2
dtype: int64

You might be wondering what the spells trying to tell us? Well, imagine we have a big book filled with information about all the Hogwarts students, just like a yearbook. This code is like a special spell to check if any pages in our book have missing information.

The first part (like saying the magic words) is where we call upon our special helpers, Pandas and NumPy. They're like clever witches and wizards who know how to read and understand the information in our book. ‍♀️✨
The next part (like pointing the wand at the book) tells them to open a specific page in our book, the one labeled "data/hogwarts-students-01.csv". This page has all the details about the students, like their names, houses, and pets.
The last part (like casting a revealing charm) is the spell that checks for missing information. It says, "Show me where any pages in the book are empty or haven't been filled in yet!"

The output we get (like the spell revealing a secret message) is a list that tells us how many empty pages there are for each piece of information about the students. For example, it says there are 25 empty pages for "pet" information, which means 25 students forgot to write down their pets in the yearbook!

So, this spells helps us see if any information is missing from our book about the Hogwarts students, just like checking if someone forgot to fill out their part of the yearbook! With these spells, our dataset is cleansed, and the missing values are filled. This ensures our magical models will perform accurately, without the fear of hidden voids causing mischief.

5.2 Filling The Empty Values

Remember our big book of Hogwarts students, like a magical yearbook? Well, it seems some of the pages got a bit messy! Just like forgetting to write down your favorite class or leaving your pet owl unnamed, there are some empty spaces in our book.

We'll learn some clever tricks to guess what might be missing based on what we already know. Imagine if someone forgot to write their favorite class, but we know they're always in the library – maybe they love Potions or Charms? By using a bit of magic (well, some special code!), we can make educated guesses to fill in the blanks.

5.2.1 Filling The Empty Numerical Values

# Filling missing numerical values with the mean values
hogwarts_df['age'].fillna(hogwarts_df['age'].mean(), inplace=True)
hogwarts_df['house_points'].fillna(hogwarts_df['house_points'].mean(), inplace=True)

The .mean() method in pandas is used to calculate the mean (average) of the values in a series or a column of a DataFrame. It returns the arithmetic mean of the values in the series.

hogwarts_df['age'].fillna(hogwarts_df['age'].mean(), inplace=True)

The .mean() method is used to calculate the mean of the age column in the hogwarts_df DataFrame. This mean value is then used to fill any missing values in the age column using the .fillna() method.

Here's what the code does:

hogwarts_df['age'].mean() calculates the mean of the age column.
hogwarts_df['age'].fillna() fills any missing values in the age column with the mean value calculated in step 1.
inplace=True ensures that the original DataFrame is modified directly, rather than creating a new DataFrame.

In essence, this code fills any missing values in the age column with the mean of the existing values in that column. This is a common technique to handle missing values in numerical columns.

5.2.2 Filling The Empty Categorical Values

# Filling missing categorical values with the mode
hogwarts_df['house'].fillna(hogwarts_df['house'].mode()[0], inplace=True)
hogwarts_df['gender'].fillna(hogwarts_df['gender'].mode()[0], inplace=True)
hogwarts_df['specialty'].fillna(hogwarts_df['specialty'].mode()[0], inplace=True)
hogwarts_df['blood_status'].fillna(hogwarts_df['blood_status'].mode()[0], inplace=True)
hogwarts_df['pet'].fillna(hogwarts_df['pet'].mode()[0], inplace=True)
hogwarts_df['wand_type'].fillna(hogwarts_df['wand_type'].mode()[0], inplace=True)
hogwarts_df['patronus'].fillna(hogwarts_df['patronus'].mode()[0], inplace=True)
hogwarts_df['quidditch_position'].fillna(hogwarts_df['quidditch_position'].mode()[0], inplace=True)
hogwarts_df['favorite_class'].fillna(hogwarts_df['favorite_class'].mode()[0], inplace=True)

This next part of our magical adventure is like fixing those forgetful pages. We'll use special spells to fill in the missing information, especially the features set that set into a categorical data type, making sure our book is complete and ready for anything! 🪄

hogwarts_df['house'].fillna(hogwarts_df['house'].mode()[0], inplace=True)

The .mode() method is used to find the most common value in the house column of the hogwarts_df DataFrame. This value is then used to fill any missing values in the house column using the .fillna() method.

Here's what the code does:

hogwarts_df['house'].mode() finds the most common value in the house column.
house is used to access the first element of the list returned by .mode(), which is the most common value.
hogwarts_df['house'].fillna() fills any missing values in the house column with the most common value.

In essence, this code fills any missing values in the house column with the most common value found in that column. Now that we've fixed the empty values within our data set, let's redo them again and check the final output, if we see all zeroes in all of the columns, it'd imply we're good to go.

5.3 Verifying All Missing Values Are Handled

# Verifying that all missing values are handled
print(hogwarts_df.isnull().sum())

Unnamed: 0            0
name                  0
gender                0
age                   0
origin                0
specialty             0
house                 0
blood_status          0
pet                   0
wand_type             0
patronus              0
quidditch_position    0
boggart               0
favorite_class        0
house_points          0
dtype: int64

Great job sorcerers, you've accomplished one remarkable spell, no empty cell from the dataset had been detected this time. And now you might be wondering how to check the whole dataset in a convenience method, well fear not my dear sorcerers, as I'll guide you to with the following magical spell.

import pandas as pd
from IPython.display import display, HTML

# Create a scrollable pane for the DataFrame
scrollable_df = HTML('<div style="height: 400px; overflow-y: scroll;">' + hogwarts_df.to_html() + '</div>')

# Display the scrollable DataFrame
display(scrollable_df)

5.3.1 Editing Inline Values

Now that we've done the necessary steps to fill-up the empty cell values, but problem still persists, as it turned out that there is still a cell that's not correctly filled with the correct data type, it's located within the 12th row and the 14th column from the index axis. Let's try to fix them with the following spells.

# Importing necessary libraries
import pandas as pd

# Display the original value at the specified location for verification
original_value = hogwarts_df.iloc[12, 12]
print(f"Original value at row 12, column 13: {original_value}")

# Determine the most common value in the 'favorite_class' column
common_value = hogwarts_df['favorite_class'].mode()[0]
print(f"Most common value in 'favorite_class': {common_value}")

# Replace the incorrect value with the most common value
hogwarts_df.at[12, 'favorite_class'] = common_value

# Display the updated value at the specified location for verification
updated_value = hogwarts_df.iloc[12, 12]
print(f"Updated value at row 12, column 13: {updated_value}")

# Save the changes in place
hogwarts_df.to_csv(dataset_path, index=False)

Which in return, they would return the following results :

Original value at row 12, column 13: 80
Most common value in 'favorite_class': Charms
Updated value at row 12, column 13: Charms

Once, we've settled the previous issues, it's always a good ideas to make new copy of the dataset and pass on the index=False, so they'll maintain the original cell values remain intact.

hogwarts_df.to_csv('data/hogwarts-students-02.csv', index=False)

5.4 Other Data Types Missing Values Handling How To

Here is a table listing the methods to fill missing values for different data types that pandas can handle, along with explanations and examples for each:

Data Type	Methods to Fill Missing Values	Explanation	Example
Numeric	1. `fillna(value)`	Replaces NaN with a specified value.	`df['column'].fillna(0)`
	2. `fillna(method='bfill')`	Backward fill: Replaces NaN with the next valid value from the end.	`df['column'].fillna(method='bfill')`
	3. `fillna(method='ffill')`	Forward fill: Replaces NaN with the next valid value from the beginning.	`df['column'].fillna(method='ffill')`
	4. `fillna(method='pad')`	Similar to `ffill`, but for datetime data.	`df['column'].fillna(method='pad')`
	5. `fillna(method='backfill')`	Similar to `bfill`, but for datetime data.	`df['column'].fillna(method='backfill')`
Categorical	1. `fillna(value)`	Replaces NaN with a specified value.	`df['column'].fillna("Unknown")`
	2. `fillna(method='mode')`	Replaces NaN with the most common value.	`df['column'].fillna(df['column'].value_counts().index)`
	3. `fillna(method='bfill')`	Backward fill: Replaces NaN with the next valid value from the end.	`df['column'].fillna(method='bfill')`
	4. `fillna(method='ffill')`	Forward fill: Replaces NaN with the next valid value from the beginning.	`df['column'].fillna(method='ffill')`
	5. `fillna(value).cat.add_categories(value).fillna(value)`	Adds a new category and fills NaN with this value.	`df['column'].fillna("New Category").cat.add_categories("New Category").fillna("New Category")`
String	1. `fillna(value)`	Replaces NaN with a specified value.	`df['column'].fillna("Unknown")`
	2. `fillna(method='ffill')`	Forward fill: Replaces NaN with the next valid value from the beginning.	`df['column'].fillna(method='ffill')`
	3. `fillna(method='bfill')`	Backward fill: Replaces NaN with the next valid value from the end.	`df['column'].fillna(method='bfill')`
Datetime	1. `fillna(value)`	Replaces NaN with a specified value.	`df['column'].fillna("1900-01-01")`
	2. `fillna(method='ffill')`	Forward fill: Replaces NaN with the next valid value from the beginning.	`df['column'].fillna(method='ffill')`
	3. `fillna(method='bfill')`	Backward fill: Replaces NaN with the next valid value from the end.	`df['column'].fillna(method='bfill')`
	4. `fillna(method='pad')`	Similar to `ffill`, but for datetime data.	`df['column'].fillna(method='pad')`
	5. `fillna(method='backfill')`	Similar to `bfill`, but for datetime data.	`df['column'].fillna(method='backfill')`

5.4.1 Example for Numeric Data

import pandas as pd

# Create a sample DataFrame
data = {'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, np.nan, np.nan],
        'B': [np.nan, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]}
df = pd.DataFrame(data)

# Fill missing values with 0
df['A'].fillna(0, inplace=True)
print(df['A'])

# Fill missing values with the next valid value from the end (backward fill)
df['B'].fillna(method='bfill', inplace=True)
print(df['B'])

5.4.2 Example for Categorical Data

import pandas as pd

# Create a sample DataFrame
data = {'Color': ['Red', 'Blue', 'Red', 'Red', 'Green', np.nan, 'Green', np.nan]}
df = pd.DataFrame(data)

# Fill missing values with the most common value
df['Color'].fillna(df['Color'].value_counts().index[0], inplace=True)
print(df['Color'])

# Fill missing values with a new category and then fill with this new category
df['Color'] = df['Color'].cat.add_categories("Unknown").fillna("Unknown")
print(df['Color'])

5.4.3 Example for String Data

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['John', 'Alice', 'Bob', 'Jane', 'Mike', np.nan, 'Emily']}
df = pd.DataFrame(data)

# Fill missing values with a specified value
df['Name'].fillna("Unknown", inplace=True)
print(df['Name'])

# Fill missing values with the next valid value from the beginning (forward fill)
df['Name'].fillna(method='ffill', inplace=True)
print(df['Name'])

5.4.4 Example for Datetime Data

import pandas as pd

# Create a sample DataFrame
data = {'Date': [pd.Timestamp('2020-01-01'), pd.Timestamp('2020-01-02'), pd.Timestamp('2020-01-03'), pd.Timestamp('2020-01-04'), pd.Timestamp('2020-01-05'), np.nan, pd.Timestamp('2020-01-07')]}
df = pd.DataFrame(data)

# Fill missing values with a specified value
df['Date'].fillna(pd.Timestamp('1900-01-01'), inplace=True)
print(df['Date'])

# Fill missing values with the next valid value from the beginning (forward fill)
df['Date'].fillna(method='ffill', inplace=True)
print(df['Date'])

These examples demonstrate how to handle missing values for different data types in pandas. The methods used are flexible and can be adjusted based on the specific requirements of your dataset.

5.5 Gemika's Pop-Up Quiz: Unveiling the Mysteries

And now, dear sorcerers, my son Gemika Haziq Nugroho appears with a twinkle in his eye and a quiz in hand. Are you ready to test your knowledge and prove your mastery of data exploration?

How do you handle missing values in a dataset?
What is the purpose of a pivot table in data analysis?
How do you fill missing values with a specified value?

Answer these questions with confidence, and you will demonstrate your prowess in the art of data exploration. With our dataset now fully explored and understood, we are ready to embark on the next phase of our magical journey. Onward to our next magical journey, to deeper discoveries and greater insights of data mastery! 🌟✨🧙‍♂️

Blog