How to perform EDA on any Dataset EASY Way

Let's Dive in!

We are going to build the Ml Model for students’ performance Indicator to be beginner-friendly

The life cycle of Machine learning Project

Understanding the Problem Statement
Data Collection
Data Checks to perform
Exploratory data analysis
Data Pre-Processing
Model Training
Choose the best model

So let’s start understanding our problem statement -

This project aims to analyze how student performance, specifically test scores, is influenced by gender, ethnicity, parental level of education, lunch habits, and participation in test preparation courses.

Data Collection

Dataset Source — https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977

The data consists of 8 columns and 1000 rows.

EDA —

A preliminary step before undertaking extra formal statistical analyses or modeling

EDA, or Exploratory Data Analysis, refers to analyzing and analyzing information units to uncover styles, pick out relationships, and gain insights.

Let’s start with creating a .ipynb file

notebooks/
├── EDA.ipynb

PS: you can try your dataset this is just an easy example

/notebook/EDA.ipynb

github.com

to see our dataset’s first 5 rows-

df.head()

Dataset information 📅 ℹ

gender: sex of students -> (Male/female)

race/ethnicity: ethnicity of students -> (Group A, B, C, D, E)
parental level of education : parents’ final education ->(bachelor’s degree,some college,master’s degree,associate’s degree,high school)
lunch: having lunch before the test (standard or free/reduced)
test preparation course: complete or not complete before the test
math score
reading score
writing score

Data Checks to perform ✅

Check Missing values ⭕

df.isna().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

-> There are no missing values in the data set

Check Duplicates 👨🏻‍🤝‍👨🏻

df.duplicated().sum()

-> There are no duplicate values in the data set

Check data type 📝

# Check Null and Dtypes
df.info()

Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)

Check the number of unique values in each column 🌌

df.unique()

gender                          2
race/ethnicity                  5
parental level of education     6
lunch                           2
test preparation course         2
math score                     81
reading score                  72
writing score                  77

Check the statistics of the data set 📊

df.describe()

math score reading score writing score
count 1000.00000 1000.000000 1000.000000
mean 66.08900 69.169000 68.054000
std 15.16308 14.600192 15.195657
min 0.00000 17.000000 10.000000
25% 57.00000 59.000000 57.750000
50% 66.00000 70.000000 69.000000
75% 77.00000 79.000000 79.000000
max 100.00000 100.000000 100.000000

Insight

From the above description of numerical data, all means are very close to each other — between 66 and 68.05;

All standard deviations are also close — between 14.6 and 15.19;

While there is a minimum score of 0 for math, for writing, the minimum is much higher = 10, and for reading much higher = 17

Check various categories present in the different categorical column 🔢

Categories in 'gender' variable:      ['female' 'male']
Categories in 'race_ethnicity' variable:   ['group B' 'group C' 'group A' 'group D' 'group E']
Categories in'parental level of education' variable: ["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']
Categories in 'lunch' variable:      ['standard' 'free/reduced']
Categories in 'test preparation course' variable:      ['none' 'completed']

Next — Some Transformation 🐛🦋

We have 3 numerical features : [‘math score’, ‘reading score’, ‘writing score’]

We have 5 categorical features : [‘gender’, ‘race/ethnicity’, ‘parental level of education’, ‘lunch’, ‘test preparation course’

Adding columns for “Total Score” and “Average” (feature engineering)

In our data set, we have 3 different scores math score, reading score, writing score

so we can combine them in 2 ways Total and average this way we get 2 models to run on

df['total_marks'] = df['math score']+ df['reading score'] + df['writing score']
df['average_marks'] = df['total_marks']/3
df.head()

df['total_marks'] = df['math score']+ df['reading score'] + df['writing score']
df['average_marks'] = df['total_marks']/3
df.head()

#On comparing the marks of different students in different subjects
Number of students who scored full marks in reading:  17
Number of students who scored full marks in writing:  14
Number of students who scored full marks in maths:  7

Number of students who scored less than 20 in reading:  1
Number of students who scored less than 20 in writing:  1
Number of students who scored less than 20 in maths:  3

Insight

From the above values, we get students have performed the worst in Maths, and the Best performance is in the reading section

Exploring Data ( Visualization ) 📊💡

Visualize the average score distribution to draw some conclusions.

Histogram
Kernel Distribution Function (KDE)

Histogram & KDE — over gender X marks 💯

fig, axs = plt.subplots(1, 2, figsize=(15, 7))
plt.subplot(121)
sns.histplot(data=df,x='average_marks',bins=30,kde=True,color='g')
plt.subplot(122)
sns.histplot(data=df,x='average_marks',kde=True,hue='gender')
plt.show()

Histogram & KDE — over gender X marks 💯

Insight

Female students tend to perform better than male students.

Histogram & KDE — over gender X Lunch 😋

plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
sns.histplot(data=df,x='average_marks',kde=True,hue='lunch')
plt.subplot(142)
sns.histplot(data=df[df.gender=='female'],x='average_marks',kde=True,hue='lunch')
plt.subplot(143)
sns.histplot(data=df[df.gender=='male'],x='average_marks',kde=True,hue='lunch')
plt.show()

Histogram & KDE — over gender X Lunch 😋

Insights

Standard lunch helps perform well in exams.
Standard lunch helps perform well in exams be it a male or a female.

Histogram & KDE — over gender X parents 📊♀️♂️👨‍👩‍👧‍👦

plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
ax =sns.histplot(data=df,x='average_marks',kde=True,hue='parental level of education')
plt.subplot(142)
ax =sns.histplot(data=df[df.gender=='male'],x='average_marks',kde=True,hue='parental level of education')
plt.subplot(143)
ax =sns.histplot(data=df[df.gender=='female'],x='average_marks',kde=True,hue='parental level of education')
plt.show()

Histogram & KDE — over gender X parents

Insights

In general parent’s education doesn’t help students perform well in exams.
2nd plot shows that parents whose education is of associate’s degree or master’s degree their male child tend to perform well in the exam
3rd plot we can see there is no effect of parent’s education on female students.

Histogram & KDE — over gender X race 👩🏽‍🦱👨🏾‍🦰🧑🏼‍🦲

plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
ax =sns.histplot(data=df,x='average_marks',kde=True,hue='race/ethnicity')
plt.subplot(142)
ax =sns.histplot(data=df[df.gender=='female'],x='average_marks',kde=True,hue='race/ethnicity')
plt.subplot(143)
ax =sns.histplot(data=df[df.gender=='male'],x='average_marks',kde=True,hue='race/ethnicity')
plt.show()

Histogram & KDE — over gender X race 👩🏽‍🦱👨🏾‍🦰🧑🏼‍🦲

Insights

Students of group A and group B tend to perform poorly in exams.
Students of group A and group B tend to perform poorly in exams irrespective of whether they are male or female
Maximum score of students in all three subjects

In Extras let’s plot a violin plot to see how students performed in each subject

plt.figure(figsize=(18,8))
plt.subplot(1, 4, 1)
plt.title('MATH SCORES')
sns.violinplot(y='math score',data=df,color='red',linewidth=3)
plt.subplot(1, 4, 2)
plt.title('READING SCORES')
sns.violinplot(y='reading score',data=df,color='green',linewidth=3)
plt.subplot(1, 4, 3)
plt.title('WRITING SCORES')
sns.violinplot(y='writing score',data=df,color='blue',linewidth=3)
plt.show()

violin plot to see how students performed in each subject

Insights

From the above three plots, it's visible that most of the students score between 60–80 in Maths whereas in reading and writing most of them score from 50–80

Exploratory Data Analysis

Univariate, bivariate, and multivariate analysis are types of exploratory data analysis. They are based on the number of variables being analyzed

Univariate, bivariate, and multivariate analysis

Multivariate analysis using pieplot

plt.rcParams['figure.figsize'] = (30, 12)
plt.subplot(1, 5, 1)
size = df['gender'].value_counts()
labels = 'Female', 'Male'
color = ['red','green']

plt.pie(size, colors = color, labels = labels,autopct = '.%2f%%')
plt.title('Gender', fontsize = 20)
plt.axis('off')

plt.subplot(1, 5, 2)
size = df['race/ethnicity'].value_counts()
labels = 'Group C', 'Group D','Group B','Group E','Group A'
color = ['red', 'green', 'blue', 'cyan','orange']


plt.pie(size, colors = color,labels = labels,autopct = '.%2f%%')
plt.title('Race/Ethnicity', fontsize = 20)
plt.axis('off')

plt.subplot(1, 5, 3)
size = df['lunch'].value_counts()
labels = 'Standard', 'Free'
color = ['red','green']

plt.pie(size, colors = color,labels = labels,autopct = '.%2f%%')
plt.title('Lunch', fontsize = 20)
plt.axis('off')

plt.subplot(1, 5, 4)
size = df['test preparation course'].value_counts()
labels = 'None', 'Completed'
color = ['red','green']

plt.pie(size, colors = color,labels = labels,autopct = '.%2f%%')
plt.title('Test Course', fontsize = 20)
plt.axis('off')

plt.subplot(1, 5, 5)
size = df['parental level of education'].value_counts()
labels = 'Some College', "Associate's Degree",'High School','Some High School',"Bachelor's Degree","Master's Degree"
color = ['red', 'green', 'blue', 'cyan','orange','grey']

plt.pie(size, colors = color,labels = labels,autopct = '.%2f%%')
plt.title('Parental Education', fontsize = 20)
plt.axis('off')

plt.tight_layout()
plt.grid()

plt.show()

Multivariate analysis using pieplot

Insights

The number of Male and Female students is almost equal
The number of students is greatest in Group C
The number of students who have standard lunch is greater
Number of students who have not enrolled in any test preparation course is greater
The number of students whose parental education is “Some College” is greater followed closely by “Associate’s Degree”

Feature Wise Visualization

GENDER COLUMN

How is the distribution of Gender?
Does gender have any impact on student performance?

UNIVARIATE ANALYSIS ( How is distribution of Gender ? )

f,ax=plt.subplots(1,2,figsize=(20,10))
sns.countplot(x=df['gender'],data=df,palette ='bright',ax=ax[0],saturation=0.95)
for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=20)

plt.pie(x=df['gender'].value_counts(),labels=['Male','Female'],explode=[0,0.1],autopct='%1.1f%%',shadow=True,colors=['#ff4d4d','#ff8000'])
plt.show()

Insights

Gender has balanced data with female students are 518 (48%) and male students are 482 (52%)

BIVARIATE ANALYSIS ( Is gender has any impact on student’s performance ? )

# Identify numeric columns in the DataFrame
numeric_cols = df.select_dtypes(include=['number']).columns


# Group the DataFrame by 'gender' and calculate mean only for numeric columns
gender_group = df.groupby('gender')[numeric_cols].mean()


# Display the result
gender_group

 gender math score reading score writing score total_marks average_marks

female 63.633205 72.608108 72.467181 208.708494 69.569498
male 68.728216 65.473029 63.311203 197.512448 65.837483

Is gender have any impact on student performance?

Insights

On an average females have a better overall score than men.
whereas males have scored higher in Maths.

RACE/ETHNICITY COLUMN

How is Group-wise distribution?
Does race/ethnicity have any impact on student performance?

UNIVARIATE ANALYSIS ( How is Group-wise distribution ?)

f,ax=plt.subplots(1,2,figsize=(20,10))
sns.countplot(x=df['race/ethnicity'],data=df,palette = 'bright',ax=ax[0],saturation=0.95)
for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=20)

plt.pie(x = df['race/ethnicity'].value_counts(),labels=df['race/ethnicity'].value_counts().index,explode=[0.1,0,0,0,0],autopct='%1.1f%%',shadow=True)
plt.show()

UNIVARIATE ANALYSIS ( How is Group-wise distribution ?)

Insights

Most of the students belonged to Group C /Group D.
The lowest number of students belong to a group.

BIVARIATE ANALYSIS (Does race/Ethnicity have any impact on a student’s performance?

Group_data2=df.groupby('race/ethnicity')
f,ax=plt.subplots(1,3,figsize=(20,8))
sns.barplot(x=Group_data2['math score'].mean().index,y=Group_data2['math score'].mean().values,palette = 'mako',ax=ax[0])
ax[0].set_title('Math score',color='#005ce6',size=20)


for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=15)


sns.barplot(x=Group_data2['reading score'].mean().index,y=Group_data2['reading score'].mean().values,palette = 'flare',ax=ax[1])
ax[1].set_title('Reading score',color='#005ce6',size=20)


for container in ax[1].containers:
    ax[1].bar_label(container,color='black',size=15)


sns.barplot(x=Group_data2['writing score'].mean().index,y=Group_data2['writing score'].mean().values,palette = 'coolwarm',ax=ax[2])
ax[2].set_title('Writing score',color='#005ce6',size=20)


for container in ax[2].containers:
    ax[2].bar_label(container,color='black',size=15)

Insights

Group E students have scored the highest marks.
Group A students have scored the lowest marks.
Students from a lower Socioeconomic status have a lower avg in all course subjects

Parental Level Of Education Column

What is the educational background of the student’s parent?
Is parental education has any impact on student’s performance ?

UNIVARIATE ANALYSIS ( What is the educational background of the student’s parent ? )

plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('fivethirtyeight')
sns.countplot(df['parental level of education'], palette = 'Blues')
plt.title('Comparison of Parental Education', fontweight = 30, fontsize = 20)
plt.xlabel('Degree')
plt.ylabel('count')
plt.show()

What is the educational background of the student’s parent?

Insights

The largest number of parents are from some college.

BIVARIATE ANALYSIS (Does parental education have any impact on student performance ? )

# Select only numeric columns for aggregation
numeric_cols = df.select_dtypes(include=['number'])

# Group by 'parental level of education', calculate mean for numeric columns, and plot
numeric_cols.groupby(df['parental level of education']).mean().plot(kind='barh', figsize=(10, 10))
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

Insights

The scores of students whose parents possess master's and bachelor's level education are higher than others*.*

LUNCH COLUMN

Which type of lunch is most common among students?
What is the effect of lunch type on test results?

UNIVARIATE ANALYSIS ( Which type of lunch is most common among students ? )

plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('seaborn-talk')
sns.countplot(df['lunch'], palette = 'PuBu')
plt.title('Comparison of different types of lunch', fontweight = 30, fontsize = 20)
plt.xlabel('types of lunch')
plt.ylabel('count')
plt.show()

UNIVARIATE ANALYSIS ( Which type of lunch is most common among students ? )

Insights

Students being served Standard lunch was more than free lunch

BIVARIATE ANALYSIS (Does lunch type intake have any impact on student’s performance ? )

f,ax=plt.subplots(1,2,figsize=(20,8))
sns.countplot(x=df['parental level of education'],data=df,palette = 'bright',hue='test preparation course',saturation=0.95,ax=ax[0])
ax[0].set_title('Students vs test preparation course ',color='black',size=25)
for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=20)

sns.countplot(x=df['parental level of education'],data=df,palette = 'bright',hue='lunch',saturation=0.95,ax=ax[1])
for container in ax[1].containers:
    ax[1].bar_label(container,color='black',size=20)

BIVARIATE ANALYSIS (Does lunch type intake have any impact on student’s performance ? )

Insights

Students who get Standard Lunch tend to perform better than students who get free*/*reduced lunch

TEST PREPARATION COURSE COLUMN

Which type of lunch is most common among students?
Is a Test prepration course have any impact on a student’s performance?

BIVARIATE ANALYSIS ( Is Test prepration course have any impact on student’s performance ? )

plt.figure(figsize=(12,6))
plt.subplot(2,2,1)
sns.barplot (x=df['lunch'], y=df['math score'], hue=df['test preparation course'])
plt.subplot(2,2,2)
sns.barplot (x=df['lunch'], y=df['reading score'], hue=df['test preparation course'])
plt.subplot(2,2,3)
sns.barplot (x=df['lunch'], y=df['writing score'], hue=df['test preparation course'])
<Axes: xlabel='lunch', ylabel='writing score'>

Insights

Students who have completed the Test Prepration Course have scores higher in all three categories than those who haven’t taken the course

CHECKING OUTLIERS

plt.subplots(1,4,figsize=(16,5))
plt.subplot(141)
sns.boxplot(df['math score'],color='skyblue')
plt.subplot(142)
sns.boxplot(df['reading score'],color='hotpink')
plt.subplot(143)
sns.boxplot(df['writing score'],color='yellow')
plt.subplot(144)
sns.boxplot(df['average_marks'],color='lightgreen')
plt.show()

CHECKING OUTLIERS

MULTIVARIATE ANALYSIS USING PAIRPLOT

sns.pairplot(df,hue = 'gender')
plt.show()

Insights

From the above plot, it is clear that all the scores increase linearly.

Conclusions

Student’s Performance is related to lunch, race, parental level education
Females lead in pass percentage and also are top-scorers
Student Performance is not much related to the test preparation course
Finishing the preparation course is beneficial.

Blog

How to perform EDA on any Dataset EASY Way

Rahul Saini

The life cycle of Machine learning Project

So let’s start understanding our problem statement -

Data Collection

EDA —

Dataset information 📅 ℹ

gender: sex of students -> (Male/female)

Data Checks to perform ✅

Exploring Data ( Visualization ) 📊💡

Exploratory Data Analysis

Multivariate analysis using pieplot

Feature Wise Visualization

UNIVARIATE ANALYSIS ( How is distribution of Gender ? )

BIVARIATE ANALYSIS ( Is gender has any impact on student’s performance ? )

UNIVARIATE ANALYSIS ( How is Group-wise distribution ?)

BIVARIATE ANALYSIS (Does race/Ethnicity have any impact on a student’s performance?

Parental Level Of Education Column

UNIVARIATE ANALYSIS ( What is the educational background of the student’s parent ? )

BIVARIATE ANALYSIS (Does parental education have any impact on student performance ? )

LUNCH COLUMN

UNIVARIATE ANALYSIS ( Which type of lunch is most common among students ? )

BIVARIATE ANALYSIS (Does lunch type intake have any impact on student’s performance ? )

TEST PREPARATION COURSE COLUMN

BIVARIATE ANALYSIS ( Is Test prepration course have any impact on student’s performance ? )

CHECKING OUTLIERS

MULTIVARIATE ANALYSIS USING PAIRPLOT

Conclusions

Join Our Newsletter. No Spam, Only the good stuff.

Related