TLDR

In this article, we’ll go over a standard supervised classification task. A classification problem where we predict whether a loan should be approved or not.

Outline

Introduction
Before we begin
How to code
Data cleaning
Data visualization
Feature engineering
Model training
Conclusion

Introduction

The “Dream Housing Finance” company deals in all home loans. They have a presence across all urban, semi-urban and rural areas. Customer’s here first apply for a home loan and the company validates the customer’s eligibility for a loan. The Company wants to automate the loan eligibility process (real-time) based on customer details provided while filling out online application forms. These details are “Gender”, “Married”, “Education”, “Dependents”, “Income”, “Loan_Amount”, “Credit_History” and others. To automate the process, they have given a problem to identify the customer segments that are eligible for the loan amount and they can specifically target these customers.

Source: Vectorstock.com

Before we begin

Let’s get familiarized with the dataset.

Loan.csv — It consists of dataset attributes for a loan with the below-mentioned description.

The different variables present in the dataset are:

Numerical features: Applicant_Income, Coapplicant_Income, Loan_Amount, Loan_Amount_Term and Dependents.
Categorical features: Gender, Credit_History, Self_Employed, Married and Loan_Status.
Alphanumeric Features: Loan_Id.
Text Features: Education and Property_Area.

As mentioned above we need to predict our target variable which is “Loan_Status”. “Loan_Status” can have two values.

Y (Yes): If the loan is approved.

N (No): If the loan is not approved.

So using the training dataset we’ll train our model and predict our target column “Loan_Status”.

Like all other guides, we’ll be using Google Colab and start by importing those datasets into dataframes.

How to code

The company will approve the loan for the applicants having a good “Credit_History” and who is likely to be able to repay the loans. For that, we’ll load the dataset “Loan.csv” in a dataframe to display the first five rows and check their shape to ensure we have enough data to make our model production-ready.

There are “614” rows and “13” columns which is enough data to make a production-ready model. The input attributes are in numerical and categorical form to analyze the attributes and to predict our target variable “Loan_Status". Let’s understand the statistical information of numerical variables by using the “describe()” function.

By the “describe()” function we see that there’re some missing counts in the variables “LoanAmount”, “Loan_Amount_Term” and “Credit_History” where the total count should be “614” and we’ll have to pre-process the data to handle the missing data.

Data Cleaning

Data cleaning is a process to identify and correct errors in the dataset that may negatively impact our predictive model. We’ll find the “null” values of every column as an initial step to data cleaning.

We observe that there are “13” missing values in “Gender”, “3” in “Married”, “15” in “Dependents”, “32” in “Self_Employed”, “22” in “Loan_Amount”, “14” in “Loan_Amount_Term” and “50” in “Credit_History”.

The missing values of the numerical and categorical features are “missing at random (MAR)” i.e. the data is not missing in all the observations but only within sub-samples of the data.

So the missing values of the numerical features should be filled with “mean” and the categorical features with “mode” i.e. the most frequently occurring values. We use Pandas “fillna()” function for imputing the missing values as the estimate of “mean” gives us the central tendency without the extreme values and “mode” is not affected by extreme values; moreover both provide neutral output. For more information on imputing data refer to our guide on estimating lost data.

Let’s check the “null” values again to ensure that there are no missing values as it will lead us to incorrect results.

From the above output, we see that there’re no null values and now we can perform the data visualization.

Data Visualization

To gain a few insights about the data we visualize the categorical data before training the model.

Categorical Data- Categorical data is a type of data that is used to group information with similar characteristics and is represented by discrete labelled groups eg. gender, blood type, country affiliation. You can read the blogs on categorical data for more understanding of datatypes.

Now let’s visualize the numerical features.

Numerical Data- Numerical data expresses information in the form of numbers eg. height, weight, age. If you are unfamiliar, please read blogs on numerical data.

Feature Engineering

To create a new attribute named “Total_Income” we’ll add two columns “Coapplicant_Income” and “Applicant_Income” as we assume that “Coapplicant” is the person from the same family for an eg. spouse, father etc. and display the first five rows of the “Total_Income”. To learn more about column creation with conditions refer to our lesson adding column with conditions.

“Total_Income” is the last column added to our dataframe as above.

We see that there are extreme values in the range of “0-10,000” and the data is left-skewed which might be possible that some people may have applied for high loans due to specific needs. To learn more about skewed data and uniform distribution of data kindly refer to our lesson on graphical analysis. So, we’ll apply log transformation on “Total_Income” to make it closer to normal in the distributed data.

Below is the graph for “Total_Income_Log”.

Data Cleaning

As a part of the data cleaning process, let’s drop unnecessary columns which don’t affect the “Loan-Status” variable. This helps in improving the accuracy of the model and we’ll display the first five rows of the dataframe.

Category Value Mapping

By using “label encoding” we’ll convert the categorical features to numerical features and display the first five rows of the dataframe.

Normalizing Imbalanced Data

Before we start training the model we’ve to normalize imbalanced data. Imbalance data are instances where the number of observations is not the same for all the classes in a classification dataset. You can refer to our guide to learn more about a balanced dataset.

In our dataset, the target variable “Loan_Status” is highly imbalanced which may result in biased output. So, we’ll balance the data by performing “undersampling” of the data. “Undersampling” is a technique in which it randomly selects examples from the majority class and deletes them from the training dataset.

The data imbalance for “Loan_Status” is seen in the above graph with “68%” representing “1 (Yes)” and “31%” representing “0 (No)”. We performed “undersampling” on the target data having majority values representing “1(Yes)” and randomly deleted samples by performing an “undersampling” operation on the training data. The aim is to reduce the number of samples in the majority class so that they match up to the total number of samples in the minority class.

We selected the indices of the majority class by using the “np.random.choice” function and specified the total no. of samples required for “minority_class” and stored it in a dataframe “random_majority_indices”. Then we’ve concatenated the indices of the “minority_class” and “random_majority_indices” and stored the output in a dataframe ”under_sample_indices”.

For balancing the data we filtered out the samples from the “under_sample_indices” dataframe and saved them in a new dataframe “under_sample”. The data is balanced as shown below plotted on a graph and is now ready for training the model.

Model Training

Now, It’s time to train the model!! For this, we’ll split the data where we keep “33%” of the test data and the remaining training data. We’ll perform “cross-validation” for better performance of the model and check the accuracy of each model in percent.

We’ll train the model using “Logistic Regression” and check the accuracy of the model. “Logistic Regression” is a popular classification algorithm that is used to predict a binary outcome i.e. “Yes/No”.

After implementing the machine learning algorithm the accuracy obtained by “LogisticRegression” is 68%. Let’s plot the confusion matrix in the testing model and get the summary of the predicted results. To learn more about “confusion matrix” you can refer to our lesson on Mage matrix performance.

From the above confusion matrix, we derive that model predicted “24%” for “0 (No)” correctly and “48%” for “1 (Yes)” correctly.

Conclusion

Mage here provides us with a low code magical solution with very less effort after data cleaning with only a few clicks. Let’s train the model on Mage and check the accuracy of the model. We’ll begin by performing data cleaning and feature engineering using Mage.

Fill in missing values

We fill in the values of categorical features with “mode” and numerical features with “mean” of the column on Mage.

Column creation

We perform feature engineering by creating a new column where we add the “Applicant_Income” and “Coapplicant_Income” together and store it in a new column “Total_Income”.

Removing Columns

Let’s remove the unwanted columns not affecting the target variable “loan_status” before training the model to have better accuracy.

Model Training

For training the model on Mage we used a “Logistic Regression” algorithm. Let’s check the accuracy of the model.

After training the model on Mage the accuracy is 85% with average performance. The features which influence the prediction of the results are “Credit_History” and “Property_Area” of “Semiurban” regions. There’re also some other features that have an influence on the weight for the prediction of results.

Retraining Model

We can also retrain the model by removing some features by creating and comparing the versions to understand the improvement between the two versions.

Now on retraining the model the accuracy comes to 86% with average performance. The confusion matrix on Mage correctly predicts “85” for “1 (Yes)” and correctly predicts “20” for “0 (No)”.

Predicting Output

Let’s check how accurately our Mage model predicts the output of loan approval based on “credit_history”.

We observe that our trained model on Mage correctly predicts the target variable for loan approval based on the “Credit_History” of the applicants.

Model Training by Supreme

We can train the model using “supreme” training sessions to improve the accuracy which gives us more reliable production-ready predictions.

After training the model with a “supreme” session the accuracy achieved was 81% with average performance but the precision increased to 89.62% with excellent performance. So now we’ve automated the process of loan approval for the “Dream Housing Finance” company and provided a low code solution with Mage ” to predict loan approval status using Credit_History and Property_Area on Mage”. Go ahead and try your model on Mage.

Want to see the code? Check the “loan prediction analysis” code in the colab notebook. Want to learn more about machine learning (ML)? Visit Mage Academy.

Blog

Loan prediction

Mage

TLDR

Outline

Introduction

Before we begin

How to code

Data Cleaning

Data Visualization

Feature Engineering

Data Cleaning

Category Value Mapping

Normalizing Imbalanced Data

Model Training

Conclusion

Fill in missing values

Column creation

Removing Columns

Model Training

Retraining Model

Predicting Output

Model Training by Supreme

Join Our Newsletter. No Spam, Only the good stuff.

Related