Loan prediction

mage_ai

Mage

Posted on April 5, 2022

Loan prediction

TLDR

In this article, we’ll go over a standard supervised classification task. A classification problem where we predict whether a loan should be approved or not.

Outline

  1. Introduction
  2. Before we begin
  3. How to code
  4. Data cleaning
  5. Data visualization
  6. Feature engineering
  7. Model training
  8. Conclusion

Introduction

The “Dream Housing Finance” company deals in all home loans. They have a presence across all urban, semi-urban and rural areas. Customer’s here first apply for a home loan and the company validates the customer’s eligibility for a loan. The Company wants to automate the loan eligibility process (real-time) based on customer details provided while filling out online application forms. These details are “Gender”, “Married”, “Education”, “Dependents”, “Income”, “Loan_Amount”, “Credit_History” and others. To automate the process, they have given a problem to identify the customer segments that are eligible for the loan amount and they can specifically target these customers.

man getting loans but his application is rejectedSource: Vectorstock.com

Before we begin

Let’s get familiarized with the dataset.

Loan.csv — It consists of dataset attributes for a loan with the below-mentioned description.

Description of columns in dataset

The different variables present in the dataset are:

  1. Numerical features: Applicant_Income, Coapplicant_Income, Loan_Amount, Loan_Amount_Term and Dependents.
  2. Categorical features: Gender, Credit_History, Self_Employed, Married and Loan_Status.
  3. Alphanumeric Features: Loan_Id.
  4. Text Features: Education and Property_Area.

As mentioned above we need to predict our target variable which is “Loan_Status”. “Loan_Status” can have two values.

Y (Yes): If the loan is approved.

N (No): If the loan is not approved.

So using the training dataset we’ll train our model and predict our target column “Loan_Status”.

Like all other guides, we’ll be using Google Colab and start by importing those datasets into dataframes.

How to code

The company will approve the loan for the applicants having a good “Credit_History” and who is likely to be able to repay the loans. For that, we’ll load the dataset “Loan.csv” in a dataframe to display the first five rows and check their shape to ensure we have enough data to make our model production-ready.

first 5 rows of dataset

There are “614” rows and “13” columns which is enough data to make a production-ready model. The input attributes are in numerical and categorical form to analyze the attributes and to predict our target variable “Loan_Status". Let’s understand the statistical information of numerical variables by using the “describe()” function.

use describe() to get basic info of column data like minimum value, count, mean, standard deviation

By the “describe()” function we see that there’re some missing counts in the variables “LoanAmount”, “Loan_Amount_Term” and “Credit_History” where the total count should be “614” and we’ll have to pre-process the data to handle the missing data.

Data Cleaning

Data cleaning is a process to identify and correct errors in the dataset that may negatively impact our predictive model. We’ll find the “null” values of every column as an initial step to data cleaning.

count of all null values

We observe that there are “13” missing values in “Gender”, “3” in “Married”, “15” in “Dependents”, “32” in “Self_Employed”, “22” in “Loan_Amount”, “14” in “Loan_Amount_Term” and “50” in “Credit_History”.

The missing values of the numerical and categorical features are “missing at random (MAR)” i.e. the data is not missing in all the observations but only within sub-samples of the data.

So the missing values of the numerical features should be filled with “mean” and the categorical features with “mode” i.e. the most frequently occurring values. We use Pandas “fillna()” function for imputing the missing values as the estimate of “mean” gives us the central tendency without the extreme values and “mode” is not affected by extreme values; moreover both provide neutral output. For more information on imputing data refer to our guide on estimating lost data.

code snippet showing math

Let’s check the “null” values again to ensure that there are no missing values as it will lead us to incorrect results.

show that there's no null values in all columns now

From the above output, we see that there’re no null values and now we can perform the data visualization.

Data Visualization

To gain a few insights about the data we visualize the categorical data before training the model.

Categorical Data- Categorical data is a type of data that is used to group information with similar characteristics and is represented by discrete labelled groups eg. gender, blood type, country affiliation. You can read the blogs on categorical data for more understanding of datatypes.

Categorical Columns

Now let’s visualize the numerical features.

Numerical Data- Numerical data expresses information in the form of numbers eg. height, weight, age. If you are unfamiliar, please read blogs on numerical data.

Numerical Columns

Feature Engineering

To create a new attribute named “Total_Income” we’ll add two columns “Coapplicant_Income” and “Applicant_Income” as we assume that “Coapplicant” is the person from the same family for an eg. spouse, father etc. and display the first five rows of the “Total_Income”. To learn more about column creation with conditions refer to our lesson adding column with conditions.

screenshots of code

display code

“Total_Income” is the last column added to our dataframe as above.

display a density curve

We see that there are extreme values in the range of “0-10,000” and the data is left-skewed which might be possible that some people may have applied for high loans due to specific needs. To learn more about skewed data and uniform distribution of data kindly refer to our lesson on graphical analysis. So, we’ll apply log transformation on “Total_Income” to make it closer to normal in the distributed data.

Below is the graph for “Total_Income_Log”.

total income log

Data Cleaning

As a part of the data cleaning process, let’s drop unnecessary columns which don’t affect the “Loan-Status” variable. This helps in improving the accuracy of the model and we’ll display the first five rows of the dataframe.

code dropping columns

Category Value Mapping

By using “label encoding” we’ll convert the categorical features to numerical features and display the first five rows of the dataframe.

code label encoding

Normalizing Imbalanced Data

Before we start training the model we’ve to normalize imbalanced data. Imbalance data are instances where the number of observations is not the same for all the classes in a classification dataset. You can refer to our guide to learn more about a balanced dataset.

In our dataset, the target variable “Loan_Status” is highly imbalanced which may result in biased output. So, we’ll balance the data by performing “undersampling” of the data. “Undersampling” is a technique in which it randomly selects examples from the majority class and deletes them from the training dataset.

Loan status histogram

The data imbalance for “Loan_Status” is seen in the above graph with “68%” representing “1 (Yes)” and “31%” representing “0 (No)”. We performed “undersampling” on the target data having majority values representing “1(Yes)” and randomly deleted samples by performing an “undersampling” operation on the training data. The aim is to reduce the number of samples in the majority class so that they match up to the total number of samples in the minority class.

We selected the indices of the majority class by using the “np.random.choice” function and specified the total no. of samples required for “minority_class” and stored it in a dataframe “random_majority_indices”. Then we’ve concatenated the indices of the “minority_class” and “random_majority_indices” and stored the output in a dataframe ”under_sample_indices”.

For balancing the data we filtered out the samples from the “under_sample_indices” dataframe and saved them in a new dataframe “under_sample”. The data is balanced as shown below plotted on a graph and is now ready for training the model.

picking random majority indices

showing a balanced selection of loans applications that were accepted and rejected

Model Training

Now, It’s time to train the model!! For this, we’ll split the data where we keep “33%” of the test data and the remaining training data. We’ll perform “cross-validation” for better performance of the model and check the accuracy of each model in percent.

We’ll train the model using “Logistic Regression” and check the accuracy of the model. “Logistic Regression” is a popular classification algorithm that is used to predict a binary outcome i.e. “Yes/No”.

logreg

After implementing the machine learning algorithm the accuracy obtained by “LogisticRegression” is 68%. Let’s plot the confusion matrix in the testing model and get the summary of the predicted results. To learn more about “confusion matrix” you can refer to our lesson on Mage matrix performance.

Show confusion matrix

From the above confusion matrix, we derive that model predicted “24%” for “0 (No)” correctly and “48%” for “1 (Yes)” correctly.

Conclusion

Mage here provides us with a low code magical solution with very less effort after data cleaning with only a few clicks. Let’s train the model on Mage and check the accuracy of the model. We’ll begin by performing data cleaning and feature engineering using Mage.

Fill in missing values

We fill in the values of categorical features with “mode” and numerical features with “mean” of the column on Mage.

gif of filling in missing values

fill in missing values continued

Column creation

We perform feature engineering by creating a new column where we add the “Applicant_Income” and “Coapplicant_Income” together and store it in a new column “Total_Income”.

gif of adding columns in Mage

Removing Columns

Let’s remove the unwanted columns not affecting the target variable “loan_status” before training the model to have better accuracy.

gif of removing columns

Model Training

For training the model on Mage we used a “Logistic Regression” algorithm. Let’s check the accuracy of the model.

gif of statistics tab in Mage

After training the model on Mage the accuracy is 85% with average performance. The features which influence the prediction of the results are “Credit_History” and “Property_Area” of “Semiurban” regions. There’re also some other features that have an influence on the weight for the prediction of results.

Retraining Model

We can also retrain the model by removing some features by creating and comparing the versions to understand the improvement between the two versions.

gif of retraining

gif of comparing versions

Now on retraining the model the accuracy comes to 86% with average performance. The confusion matrix on Mage correctly predicts “85” for “1 (Yes)” and correctly predicts “20” for “0 (No)”.

confusion matrix

Predicting Output

Let’s check how accurately our Mage model predicts the output of loan approval based on “credit_history”.

Mage playground to predict values

We observe that our trained model on Mage correctly predicts the target variable for loan approval based on the “Credit_History” of the applicants.

Model Training by Supreme

We can train the model using “supreme” training sessions to improve the accuracy which gives us more reliable production-ready predictions.

supreme training

After training the model with a “supreme” session the accuracy achieved was 81% with average performance but the precision increased to 89.62% with excellent performance. So now we’ve automated the process of loan approval for the “Dream Housing Finance” company and provided a low code solution with Mage ” to predict loan approval status using Credit_History and Property_Area on Mage”. Go ahead and try your model on Mage.

Want to see the code? Check the “loan prediction analysis” code in the colab notebook. Want to learn more about machine learning (ML)? Visit Mage Academy.

💖 💪 🙅 🚩
mage_ai
Mage

Posted on April 5, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related