Overview & Purpose

With this article, we will build a decision trees model based on the Titanic data set that predicts whether a given person survived or not.

Steps:

Initial data understanding and preparation
Build, train, and test the model
Evaluate the performance of the model

1. Understanding the data set

We will use Titanic Passenger Survival Data Set. This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner "Titanic", summarized according to economic status (class), sex, age and survival. Below is a brief description of the 12 variables in the data set :

PassengerId:<int> Serial Number
Survived:<int> Contains binary Values of 0 and 1
- 0: Passenger did not survive
- 1: Passenger Survived
Pclass: <int> Ticket Class - 1st Class, 2nd Class or 3rd Class Ticket
Name: <chr> Name of the passenger
Sex: <chr> Male or Female
Age: <dbl> Age in years
SibSp: <int> No. of Siblings / Spouses — brothers, sisters and/or husband/wife
Parch: <int> No. of parents/children — mother/father and/or daughter, son
Ticket: <chr> Serial Number
Fare: <dbl> Passenger fare
Cabin: <chr> Cabin Number
Embarked: <chr> Port of Embarkment
- C: Cherbourg
- Q: Queenstown
- S: Southhampton

Load necessary data

Remove all objects in the Global Environment and load titanic data.

rm(list = ls())

# install necessary packages
library(tidyverse) 
install.packages("titanic") 
# load necessary packages
library(titanic)
# load necessary data 
titanic <-
  titanic_train

Take a look.

titanic %>%
  View(title = "Titanic")

Produce the summaries of data.
Summary() is one important functions that help in summarising each attribute in the dataset.

> summary(titanic)
  PassengerId       Survived          Pclass          Name          
 Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
 1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
 Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
 Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
 3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
 Max.   :891.0   Max.   :1.0000   Max.   :3.000                     

     Sex                 Age            SibSp           Parch       
 Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
 Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
 Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
                    Mean   :29.70   Mean   :0.523   Mean   :0.3816  
                    3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
                    Max.   :80.00   Max.   :8.000   Max.   :6.0000  
                    NA's   :177                                     
    Ticket               Fare           Cabin             Embarked        
 Length:891         Min.   :  0.00   Length:891         Length:891        
 Class :character   1st Qu.:  7.91   Class :character   Class :character  
 Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
                    Mean   : 32.20                                        
                    3rd Qu.: 31.00                                        
                    Max.   :512.33

There is two "" in Embarked. Drop them.

> titanic$Embarked[grepl("^\\s*$", titanic$Embarked)] 
[1] "" ""
> titanic <- droplevels(titanic[!grepl("^\\s*$", titanic$Embarked),,drop=FALSE])

There is also 177 NA's in Age. Use mean of age to fill NA's

> summary(titanic$Age) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.42   20.00   28.00   29.64   38.00   80.00     177 

> titanic$Age[is.na(titanic$Age)] <- 
  round(mean(titanic$Age, na.rm = TRUE))

Set categorical variables. Variables can be classified as categorical or quantitative.

Categorical variables take on values that are names or labels. ex. Embarked in our dataset.
Quantitative variables are numerical. They represent a measurable quantity. ex. Age in our dataset.

titanic$Survived = as.factor(titanic$Survived)
titanic$Pclass = as.factor(titanic$Pclass)
titanic$Embarked = as.factor(titanic$Embarked)
titanic$Sex_num = if_else(titanic$Sex == "male",
                          1,
                          0)
titanic$Sex_num = as.factor(titanic$Sex_num)

Okay, we get the data ready to use.

2. Build, train, and test the model

Choose the variables we would like to discuss. We choose Survived, Pclass, Age, SibSp, Parch, Fare, Sex_num, and Embarked.

df <- 
  titanic %>% 
  select(Survived, Pclass, Age, SibSp, Parch, Fare, Sex_num, Embarked)

Check the target variable, Survived.Good, it is not a huge class imbalance.

> df %>%count(Survived)
# A tibble: 2 x 2
  Survived     n
  <fct>    <int>
1 0          549
2 1          340

Check the distribution and correlation between variables.

library(psych)
pairs.panels(df[,],
             ellipses=FALSE,
             pch = 19, 
             hist.col="blue")

Split train and test data. Set 75% is train data.

library(caret) 
set.seed(2019)
trainIndex <- createDataPartition(df$Survived, p=0.75, list = FALSE)
train <- df[trainIndex,]
test <- df[-trainIndex,]

Build decision tree model

tree <- rpart(Survived ~., data=train, method='class')

What does the decision tree look like?

library(rpart)
prp(tree,         
    faclen=0,           
    fallen.leaves=TRUE, 
    shadow.col="gray",
    )

Another fancier way to take a look a decision tree.

library(rpart.plot)
rpart.plot(tree)

3. Evaluate the performance of the model

Use test data to evaluate the performance of the model.

X_test <-
  test %>%
  select(Pclass, Age, SibSp, Parch, Fare, Sex_num, Embarked)
pred <- predict(tree, newdata=X_test, type=c("class"))

Calculate confusion matrix and plot it.

confus.matrix <- table(real=test$Survived, predict=pred)
fourfoldplot(confus.matrix, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "Confusion Matrix")

The accuracy of the model

> sum(diag(confus.matrix))/sum(confus.matrix)
[1] 0.8333333

Hope you found this article helpful.

Blog

Build a decision tree in R

Yu Fen Lin