The AI Alpha Geek: It starts with EDA! - Part B

joyadauche

Joy Ada Uche

Posted on September 30, 2020

The AI Alpha Geek: It starts with EDA! - Part B

Before we start exploring each individual feature, let's take a look at some statistics for the dataset produced by train_df.drop('PassengerId', axis=1).describe() below:

Summary Stats

In the summary statistics above, looking at the Age feature for example:

  • the count is 714, which tells us there are 177 missing entries since the total entries are 891 - we would need to deal with this later on when handling missing values,
  • the mean age is 29.699, which is the average age of passengers who were aboard i.e the value 29.699 was the typical or normal age of the passengers aboard,
  • the std (standard deviation) of 14.526 tells us that most of the passengers are in the age range (29.699-14.526) to (29.699+14.526),
  • the min age is 0.42, which tells us the least age is for a baby on board,
  • the 25th percentile is 20.125 years shows that 25% of passengers is less than 20.125 years,
  • the 50th percentile, which is the median is 28 years, tells us that half of the passengers onboard are below 28 years old - seems most of the passengers were young,
  • the 75th percentile, which is 38, tells us that 75% of the passengers are less than 38 years, and
  • the max age is 80 years, which is the age of the eldest passenger onboard - luckily, it seems there are no aliens onboard.

Now, it's time for some univariate analysis - this is just descriptive analysis of one variable at a time which it helps us understand the data distribution for that variable and even detect outliers. Let's start with the categorical variables -

In the code example above, taking a look at the output for the target variable, Survived, below -
Output Example

  • value_counts() is used to get the counts of unique values for this column - and it seems a lot more people did not survive. Note that it is not a perfectly balanced dataset but this is not a case where the number of those who didn't survive is far more significant than those who survived.
  • to get the percentages of each class (i.e survived - 1 and deceased - 0), set the normalize parameter of value_counts() to True.
  • to have a better view of the count for each class, we use count plot via Seaborn. The label_chart() is just a helper function to label the chart.

Let's see some insights gathered from the code output from eda_part_b.py above -

  • For the Pclass feature, it seems a lot more people that were on board are in class 3 and from Part A of this series, we saw that these are people in the lower socio-economic class, which seem to mean most onboard got the cheap ticket,
  • Seems more males boarded when you look at the Sex feature, as 64.76% of passengers are males,
  • Most passengers boarded from the Southampton port, and it seems most passengers came alone since most have 0 siblings and/or travelled with just a nanny.

So, all these give us more insights to explore further - Stay tuned for the next parts on this topic, on this same series, where we go-ahead to explore individual numerical variables for patterns. Wish you an awesome October!

💖 💪 🙅 🚩
joyadauche
Joy Ada Uche

Posted on September 30, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related