The Gemika's Magical Guide to Sorting Hogwarts Students using the Decision Tree Algorithm (Part #4)
gerry leo nugroho
Posted on July 17, 2024
4. Unveiling the Mysteries: Data Exploration (EDA) π
Welcome back to the enchanting halls of Hogwarts, dear sorcerers! As we continue our magical journey into the world of data science, it's time to unveil the mysteries hidden within our dataset. In this chapter, we'll embark on a series of explorations that will reveal the secrets of our enchanted scroll (or dataset). Think of this as delving into the depths of the Room of Requirement, where each discovery leads us to greater understanding. β¨π§ββοΈ
4.1 Inspecting First Few Rows
Our first step is to take a glimpse at the first few rows of our dataset, much like opening the Marauder's Map for the first time. This will give us an initial understanding of the structure and contents of our data. And dear sorcerers, if you wished to follow along on this magical journey, make a fork of the dataset from my Github account and download the dataset to your local machine from this magical address.
# Inspecting the first few rows of the dataset
dataset_path = 'data/hogwarts-students.csv' # Path to our dataset
hogwarts_df = pd.read_csv(dataset_path)
print(hogwarts_df.head())
name gender age origin specialty \
0 Harry Potter Male 11 England Defense Against the Dark Arts
1 Hermione Granger Female 11 England Transfiguration
2 Ron Weasley Male 11 England Chess
3 Draco Malfoy Male 11 England Potions
4 Luna Lovegood Female 11 Ireland Creatures
house blood_status pet wand_type patronus \
0 Gryffindor Half-blood Owl Holly Stag
1 Gryffindor Muggle-born Cat Vine Otter
2 Gryffindor Pure-blood Rat Ash Jack Russell Terrier
3 Slytherin Pure-blood Owl Hawthorn NaN
4 Ravenclaw Half-blood NaN Fir Hare
quidditch_position boggart favorite_class \
0 Seeker Dementor Defense Against the Dark Arts
1 NaN Failure Arithmancy
2 Keeper Spider Charms
3 Seeker Lord Voldemort Potions
4 NaN Her mother Creatures
house_points
0 150.0
1 200.0
2 50.0
3 100.0
4 120.0
As we peer into these rows, we see a variety of features such as student names, house affiliations, and various traits. Each row is a story, each column a chapter. We might notice, for example, that Harry, Hermione, and Ron are all in Gryffindor, characterized by their bravery and determination. This initial inspection helps us understand the scope and scale of our dataset.
4.2 Checking Dataset Features
Next, we delve deeper into the columns of our DataFrame, much like how Hermione would meticulously study her textbooks. Each column represents a different feature of our students, from their house to their magical abilities.
# Displaying the columns of the dataset
print(hogwarts_df.columns)
As the magic spell finished its wizardry, the previous magical spell reveal the following hidden artifacts.
Index(['name', 'gender', 'age', 'origin', 'specialty', 'house', 'blood_status',
'pet', 'wand_type', 'patronus', 'quidditch_position', 'boggart',
'favorite_class', 'house_points'],
dtype='object')
# Displaying the how many rows and columns in the dataset
print(hogwarts_df.shape)
And you're guessing correctly sorcerers, the dataset consists of 52 rows and 14 columns.β¨π
(52, 14)
Let us explore these features, each as significant as a spell component in a well-crafted incantation:
- Name: The given name of our witch or wizard, from the illustrious Harry Potter to the enigmatic Luna Lovegood. π
- Gender: Whether they are a young wizard or witch, reflecting the diversity of Hogwarts.
- Age: Their age at the time of sorting, for even the youngest students have their place in the castle's storied history.
-
Origin: The place they hail from, be it the rolling hills of
England
, the rugged highlands ofScotland
, or the enchanting isles ofIreland
. ποΈ -
Specialty: Their area of magical expertise, such as
Potions
,Transfiguration
, orDefense Against the Dark Arts
, much like Professor Snapeβs mastery of the subtle art of potion-making. -
House: The revered house to which they belong β
Gryffindor
,Hufflepuff
,Ravenclaw
, orSlytherin
β each with its own rich traditions and values. -
Blood Status: Whether they are
Pure-blood
,Half-blood
, orMuggle-born
, a detail that, while significant in the wizarding world, never diminishes their magical potential. -
Pet: Their chosen magical companion, be it an
owl
, acat
, or atoad
, reminiscent of Harry's loyal Hedwig or Hermione's cleverCrookshanks
. π¦π - Wand Type: The wood and core of their wand, the very tool of their magical prowess.
- Patronus: The form their Patronus takes, a magical manifestation of their innermost self, like Harry's proud stag or Snape's ethereal doe. π¦
-
Quidditch Position: Their role in the beloved wizarding sport, whether
Seeker
,Chaser
,Beater
, orKeeper
, or perhaps no position at all. - Boggart: The form their Boggart takes, a glimpse into their deepest fears.
-
Favorite Class: The subject they excel in or enjoy the most, akin to Hermione's love for
Arithmancy
or Neville's talent inHerbology
. - House Points: Points they have contributed to their house, reflecting their achievements and misadventures alike.
With this compendium of magical features, we craft our dataset with the precision of a spell-wright composing a new enchantment. Each character's details are meticulously recorded, ensuring that our data is as rich and detailed as the tapestry of Hogwarts itself.π§ββοΈπ°
By examining these features, we gain a deeper understanding of the dataset's richness, much like a wizard learning about the different properties of magical creatures. As we assemble this treasure trove of information, we prepare ourselves for the next step in our magical journeyβtransforming these attributes into the foundations upon which our Decision Tree algorithm will cast its spell. Let us proceed, dear sorcerers, for the magic is only just beginning.β¨π§ββοΈ
4.3 Inspecting Data Types
With a clear understanding of our features, we now turn our attention to the data types. This step is akin to examining the ingredients of a potion, ensuring each component is appropriate for its intended use.
# Checking the data types of each column
print(hogwarts_df.dtypes)
And in return, the previous magic spell would yield us, dear sorcerers the following incarnations.
name object
gender object
age int64
origin object
specialty object
house object
blood_status object
pet object
wand_type object
patronus object
quidditch_position object
boggart object
favorite_class object
house_points float64
dtype: object
Wow, would you look at that, we've just discovered a lot of data types inconsistencies within the dataset. The data types had told us whether each column contains numerical values, text, or other forms of data. For instance, Age should be a numerical type
, while Name
and House
are text (or string)
types. Ensuring these types are correct is crucial for our subsequent analyses and visualizations.
4.4 Incorrect Data Type
Occasionally, we may find discrepancies in the data types, much like finding a rogue ingredient in a potion. Correcting these mismatches is essential to ensure the accuracy of our spells (or analyses). So let's just spin our wands (should I say Jupyter Lab), and try to fix them this time.
# Converting data types if necessary
# First, let's check the columns again to identify the correct names
print(hogwarts_df.columns)
Index(['name', 'gender', 'age', 'origin', 'specialty', 'house', 'blood_status',
'pet', 'wand_type', 'patronus', 'quidditch_position', 'boggart',
'favorite_class', 'house_points'],
dtype='object')
Among one of the requirements to perform the magical data sorcery tasks, is that you need to have a clean dataset that is by its naming convention is easy to follow and easy to work with at the same time. Now, let's try to change the data types according to its nature, by means to have an easier dataset to navigate with according to our next enchanted upcoming magical spells.
# Assuming we identified 'age' as the correct column name for age
hogwarts_df['age'] = pd.to_numeric(hogwarts_df['age'], errors='coerce') # Ensure Age is numeric
# Ensuring 'gender' is categorical
hogwarts_df['gender'] = hogwarts_df['gender'].astype('category') # Ensure Gender is categorical
# Ensuring 'specialty' is categorical
hogwarts_df['specialty'] = hogwarts_df['specialty'].astype('category') # Ensure specialty is categorical
# Ensuring 'house' is categorical
hogwarts_df['house'] = hogwarts_df['house'].astype('category') # Ensure house is categorical
# Ensuring 'blood_status' is categorical
hogwarts_df['blood_status'] = hogwarts_df['blood_status'].astype('category') # Ensure blood_status is categorical
# Ensuring 'pet' is categorical
hogwarts_df['pet'] = hogwarts_df['pet'].astype('category') # Ensure pet is categorical
# Ensuring 'wand_type' is categorical
hogwarts_df['wand_type'] = hogwarts_df['wand_type'].astype('category') # Ensure wand_type is categorical
# Ensuring 'quidditch_position' is categorical
hogwarts_df['quidditch_position'] = hogwarts_df['quidditch_position'].astype('category') # Ensure quidditch_position is categorical
# Ensuring 'favorite_class' is categorical
hogwarts_df['favorite_class'] = hogwarts_df['favorite_class'].astype('category') # Ensure favorite_class is categorical
By casting these spells, we ensure that each column is of the appropriate type, ready for further exploration and manipulation. This step is much like Snape meticulously adjusting the ingredients of a complex potion to achieve the perfect brew. Now, once we've done the previous spell, the Jupyter would yield us the following updated results.
Now let's verify the previous spell has done it magical course towards our dataset by invoking the following spell again.
# Verify the data types after conversion
print(hogwarts_df.dtypes)
name object
gender category
age int64
origin object
specialty category
house category
blood_status category
pet category
wand_type category
patronus object
quidditch_position category
boggart object
favorite_class category
house_points float64
dtype: object
4.5 Spells and Charms to Convert Data Types
In case you dear sorcerers are wondering how many data types pandas is capable of supporting, following are all the list of them and ways to manipulate them in orders.
Data Type | Description | Example Values | Conversion Method |
---|---|---|---|
int64 | Integer values | 1, 2, 3, -5, 0 | pd.to_numeric(df['column']) |
float64 | Floating point numbers | 1.0, 2.5, -3.4, 0.0 | pd.to_numeric(df['column']) |
bool | Boolean values | True, False | df['column'].astype('bool') |
object | String values | 'apple', 'banana', '123' | df['column'].astype('str') |
datetime64[ns] | Date and time values | '2024-07-17', '2023-01-01 12:00' | pd.to_datetime(df['column']) |
timedelta[ns] | Differences between datetimes | '1 days 00:00:00', '2 days 03:04:05' | pd.to_timedelta(df['column']) |
category | Categorical data | 'A', 'B', 'C' | df['column'].astype('category') |
4.6 Reinvestigate The Data Type in The Dataset
Having ensured the correctness of our data types, it's time to take a more comprehensive look at our dataset. This step is akin to casting a revealing charm over a hidden room, allowing us to see everything at once.
# Displaying a summary of the entire data types
print(hogwarts_df.info())
By previewing the whole dataset, we gain a holistic view of its structure, contents, and summary statistics. This comprehensive overview helps us identify any remaining inconsistencies or areas that require further attention, much like a careful sweep of the castle grounds to ensure everything is in order, as the following results.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 52 non-null object
1 gender 52 non-null category
2 age 52 non-null int64
3 origin 52 non-null object
4 specialty 52 non-null category
5 house 52 non-null category
6 blood_status 52 non-null category
7 pet 27 non-null category
8 wand_type 52 non-null category
9 patronus 50 non-null object
10 quidditch_position 10 non-null category
11 boggart 52 non-null object
12 favorite_class 51 non-null category
13 house_points 50 non-null float64
dtypes: category(8), float64(1), int64(1), object(4)
memory usage: 6.8+ KB
None
4.7 Detailed Summary of Dataset
And here's the interesting part, how one sorcerers may see thing from a high level overview, while this time the spell would give us the following information about the dataset. It's a bit statistical for sure, but fear not dear sorcerers, as you scroll forward, you'll notice couple of other stunning facts around Hogwarts students.
print(hogwarts_df.describe(include='all')) # Providing a detailed summary of the dataset
name gender age origin specialty house \
count 52 52 52.000000 52 52 52
unique 52 2 NaN 9 24 6
top Harry Potter Male NaN England Charms Gryffindor
freq 1 27 NaN 35 7 18
mean NaN NaN 14.942308 NaN NaN NaN
std NaN NaN 2.492447 NaN NaN NaN
min NaN NaN 11.000000 NaN NaN NaN
25% NaN NaN 13.250000 NaN NaN NaN
50% NaN NaN 16.000000 NaN NaN NaN
75% NaN NaN 17.000000 NaN NaN NaN
max NaN NaN 18.000000 NaN NaN NaN
blood_status pet wand_type patronus quidditch_position boggart \
count 52 27 52 50 10 52
unique 4 9 28 15 5 11
top Half-blood Owl Fir Non-corporeal Seeker Failure
freq 25 11 4 34 5 40
mean NaN NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN NaN
favorite_class house_points
count 51 50.000000
unique 21 NaN
top Charms NaN
freq 8 NaN
mean NaN 119.200000
std NaN 54.129097
min NaN 10.000000
25% NaN 72.500000
50% NaN 125.000000
75% NaN 160.000000
max NaN 200.000000
From the summary, we can infer several interesting points:
- House Distribution: Gryffindor has the highest count with 18 students, showing its prominence.
- Age: The average age of students is around 14.94 years, with the youngest being 11 and the oldest 18.
- Gender: The dataset includes 27 males and 25 females, showing a fairly balanced gender distribution.
- Blood Status: Half-bloods are the most common, with 25 occurrences, indicating a diverse student body.
- Wands and Pets: There are 28 unique wand types and 9 different pet types, reflecting the unique personalities and preferences of the students.
- Quidditch: Only a few students play Quidditch, with Seeker being the most common position.
- Favorite Class: Charms is the most favored class among students, with 8 mentions.
- House Points: The average house points are 119.2, with a standard deviation of 54.13, indicating a wide range of performance.
4.8 Preview the whole Dataset
For the curious mind that their thoughts flew as fast as their broomstick, here's the magic spell to display the whole values within the dataset.
# Displaying a summary of the entire dataset
print(hogwarts_df.to_string())
name gender age origin specialty house blood_status pet wand_type patronus quidditch_position boggart favorite_class house_points
0 Harry Potter Male 11 England Defense Against the Dark Arts Gryffindor Half-blood Owl Holly Stag Seeker Dementor Defense Against the Dark Arts 150.0
1 Hermione Granger Female 11 England Transfiguration Gryffindor Muggle-born Cat Vine Otter NaN Failure Arithmancy 200.0
2 Ron Weasley Male 11 England Chess Gryffindor Pure-blood Rat Ash Jack Russell Terrier Keeper Spider Charms 50.0
3 Draco Malfoy Male 11 England Potions Slytherin Pure-blood Owl Hawthorn NaN Seeker Lord Voldemort Potions 100.0
4 Luna Lovegood Female 11 Ireland Creatures Ravenclaw Half-blood NaN Fir Hare NaN Her mother Creatures 120.0
5 Neville Longbottom Male 11 England Herbology Gryffindor Pure-blood Toad Cherry Non-corporeal NaN Severus Snape Herbology 70.0
6 Ginny Weasley Female 11 England Defense Against the Dark Arts Gryffindor Pure-blood Owl Yew Horse Chaser Tom Riddle Defense Against the Dark Arts 140.0
7 Cedric Diggory Male 15 England Quidditch Hufflepuff Pure-blood NaN Ash Non-corporeal Seeker Failure Defense Against the Dark Arts 160.0
8 Cho Chang Female 14 Scotland Charms Ravenclaw Half-blood Owl Hazel Swan Seeker Failure Charms 110.0
9 Severus Snape Male 16 England Potions Slytherin Half-blood NaN Elm Doe NaN Lily Potter Potions 90.0
10 Albus Dumbledore Male 17 England Transfiguration Gryffindor Half-blood Phoenix Elder Phoenix NaN Ariana's death Transfiguration 200.0
11 Minerva McGonagall Female 16 Scotland Transfiguration Gryffindor Half-blood Cat Fir Cat NaN Failure Transfiguration 190.0
12 Bellatrix Lestrange Female 15 England Dark Arts Slytherin Pure-blood NaN Walnut NaN Azkaban Dueling 80 NaN
13 Nymphadora Tonks Female 14 Wales Metamorphmagus Hufflepuff Half-blood Owl Blackthorn Wolf NaN Failure Defense Against the Dark Arts 130.0
14 Remus Lupin Male 16 England Defense Against the Dark Arts Gryffindor Half-blood Dog Cypress Non-corporeal NaN Full Moon Defense Against the Dark Arts 150.0
15 Sirius Black Male 16 England Transfiguration Gryffindor Pure-blood Owl Chestnut Dog Beater Full Moon Defense Against the Dark Arts 140.0
16 Horace Slughorn Male 16 England Potions Slytherin Half-blood NaN Cedar Non-corporeal NaN Failure Potions 100.0
17 Filius Flitwick Male 17 England Charms Ravenclaw Half-blood NaN Hornbeam Non-corporeal NaN Failure Charms 180.0
18 Pomona Sprout Female 16 England Herbology Hufflepuff Pure-blood Cat Pine Non-corporeal NaN Failure Herbology 170.0
19 Helena Ravenclaw Female 17 Scotland Charms Ravenclaw Pure-blood NaN Rowan Non-corporeal NaN Her mother Charms 160.0
20 Godric Gryffindor Male 17 England Dueling Gryffindor Pure-blood NaN Sword Lion NaN Failure Dueling 200.0
21 Helga Hufflepuff Female 17 Wales Herbology Hufflepuff Pure-blood NaN Cedar Non-corporeal NaN Failure Herbology 190.0
22 Rowena Ravenclaw Female 17 Scotland Charms Ravenclaw Pure-blood NaN Maple Eagle NaN Failure Charms 180.0
23 Salazar Slytherin Male 17 England Dark Arts Slytherin Pure-blood NaN Ebony Serpent NaN Failure Dark Arts 200.0
24 Molly Weasley Female 16 England Household Charms Gryffindor Pure-blood Owl Pine Non-corporeal NaN Failure Household Charms 80.0
25 Arthur Weasley Male 16 England Muggle Artifacts Gryffindor Pure-blood NaN Hornbeam Non-corporeal NaN Failure Muggle Studies 60.0
26 Lucius Malfoy Male 16 England Dark Arts Slytherin Pure-blood Owl Elm Non-corporeal NaN Failure Dark Arts 90.0
27 Narcissa Malfoy Female 15 England Potions Slytherin Pure-blood NaN Hawthorn Non-corporeal NaN Failure Potions 70.0
28 Pansy Parkinson Female 11 England Gossip Slytherin Pure-blood Cat Birch Non-corporeal NaN Failure Gossip 40.0
29 Vincent Crabbe Male 11 England Strength Slytherin Pure-blood NaN Oak Non-corporeal NaN Failure Strength 50.0
30 Gregory Goyle Male 11 England Strength Slytherin Pure-blood NaN Alder Non-corporeal NaN Failure Strength 50.0
31 Lily Evans Female 11 England Charms Gryffindor Muggle-born NaN Willow Doe NaN Failure Charms 150.0
32 James Potter Male 11 England Dueling Gryffindor Pure-blood Owl Walnut Stag Chaser Failure Dueling 160.0
33 Peter Pettigrew Male 11 England Transformation Gryffindor Half-blood Rat Ash Non-corporeal NaN Failure Transformation 30.0
34 Gilderoy Lockhart Male 15 England Memory Charms Ravenclaw Half-blood NaN Cherry Non-corporeal NaN Failure Memory Charms 70.0
35 Dolores Umbridge Female 15 England Dark Arts Slytherin Half-blood Cat Hemlock Non-corporeal NaN Failure Dark Arts 60.0
36 Newt Scamander Male 17 England Magical Creatures Hufflepuff Half-blood Demiguise Chestnut Non-corporeal NaN Failure Creatures 160.0
37 Tina Goldstein Female 17 USA Auror Hufflepuff Half-blood Owl Ash Non-corporeal NaN Failure Defense Against the Dark Arts 140.0
38 Queenie Goldstein Female 17 USA Legilimency Ravenclaw Half-blood Owl Cypress Non-corporeal NaN Failure Legilimency 130.0
39 Jacob Kowalski Male 17 USA Baking Hufflepuff No-mag NaN Birch Non-corporeal NaN Failure Baking 10.0
40 Theseus Scamander Male 17 England Auror Gryffindor Half-blood Dog Elder Non-corporeal NaN Failure Defense Against the Dark Arts 150.0
41 Leta Lestrange Female 16 England Potions Slytherin Pure-blood Cat Ebony Non-corporeal NaN Failure Potions 100.0
42 Nagini Female 18 Indonesia Transformation Slytherin Half-blood Snake Teak Non-corporeal NaN Failure Transformation 90.0
43 Grindelwald Male 18 Europe Dark Arts Slytherin Pure-blood NaN Elder Non-corporeal NaN Failure Dark Arts 200.0
44 Bathilda Bagshot Female 17 England History of Magic Ravenclaw Half-blood Cat Willow Non-corporeal NaN Failure NaN NaN
45 Aberforth Dumbledore Male 17 England Goat Charming Gryffindor Half-blood Goat Oak Non-corporeal NaN Failure Goat Charming 70.0
46 Ariana Dumbledore Female 14 England Obscurus Gryffindor Half-blood NaN Fir Non-corporeal NaN Failure Obscurus 20.0
47 Victor Krum Male 17 Bulgaria Quidditch Durmstrang Pure-blood NaN Hawthorn Non-corporeal Seeker Failure Quidditch 180.0
48 Fleur Delacour Female 17 France Charms Beauxbatons Half-blood NaN Rosewood Non-corporeal NaN Failure Charms 140.0
49 Gabrielle Delacour Female 14 France Charms Beauxbatons Half-blood NaN Alder Non-corporeal NaN Failure Charms 80.0
50 Olympe Maxime Female 17 France Strength Beauxbatons Half-blood NaN Fir Non-corporeal NaN Failure Strength 110.0
51 Igor Karkaroff Male 18 Europe Dark Arts Durmstrang Half-blood NaN Yew Non-corporeal NaN Failure Dark Arts 90.0
Once we've manipulated the data types from the dataset, it's time to save the existing dataset, so that it'd be ready for our next set of adventures.
hogwarts_df.to_csv('data/hogwarts-students-01.csv')
Now dear sorcerers, once you've invoked the previous spell, you may check within your data
directory that there's a new CSV file already with the name of hogwarts-students-01.csv
which will be utilizing through out the next of our magical journey onward.
4.9 Gemika's Pop-Up Quiz: Unveiling the Mysteries
And now, young wizards and witches, my son Gemika Haziq Nugroho
appears with a sparkle in his eye and a quiz at the ready. Are you prepared to test your newfound knowledge and prove your prowess in data exploration?
- What function do we use to display the first few rows of a DataFrame?
- Why is it important to check the data types of each column in our dataset?
- How can we convert a column to a numeric type if it's not already?
Answer these questions with confidence, and you will demonstrate your mastery of the initial steps in data exploration. With our dataset now fully understood and prepared, we are ready to dive even deeper into its mysteries. Onward, to greater discoveries! πβ¨π§ββοΈ
By now, you should feel like a true data wizard, ready to uncover the hidden patterns and secrets within any dataset. Let us continue our journey with confidence and curiosity, for there is much more to discover in the magical world of data science! ππ
Posted on July 17, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
July 17, 2024