How to build a simple Machine Learning Regression Model.

In this article, I'll provide a step to step method of building a regression model using sklearn's linear regression.
A regression model is a supervised machine learning model which predicts numerical values based on numeric or boolean inputs and data provided., for example house pricing prediction.
In this article, we'll be using the dataset obtained from kaggle.

Training the model.

Import useful libraries.

First, you import the useful functions we'll make use of and read the CSV file using pandas.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error

Numpy is a numerical library in python which contains faster access to some statistical calculations.
Pandas is a python library which is used for accessing and manipulating the dataset.

Matplotlib and seaborn are visualization libraries which we would use in later articles.

Sklearn is a python library that contains several machine learning models and tools for model evaluation. Today we’ll be using one of the libraries which is the linear regression model.

Reading the data.

First you can read the data by using pandas read csv method.
data=pd.read_csv(‘/Users/user/loan_sanction_train.csv’)

You can find out more about the data by checking the first 5 rows using the data.head() method.
data.head()
Output:

Expository Data Analysis

To get more information about the data we can check more by using the pandas info method.

data.info()

Output:

From there, we read about the number of columns and rows present in the dataset as well as their respective datatypes.
To get statistical inference from the dataset, we use the pandas describe method.

data.describe()

Output:

From the output seen, we can see the overall statistics in the dataset, which contains the mean, median,25th percentile, 75th percentile, standard deviation, minimum and maximum value.
I'll provide more explanation of these as well as useful visualizations in a future article on EDA(Expository data analysis).

Data Cleaning and Preparation

To train your model you need to convert some of the categorical values into numerical variables. You can then transform each column using pandas value counts to see each unique values in a column.
Example:
data[‘sex’].value_counts()
Output:

You can create a function that takes the data and column and then gets an assigned number to the particular category to convert all categorical variables into numeric datatypes.

def category_val(df,col): df[col]=df[col].astype('category') df[col]=df[col].cat.codes return df[col] data['sex']=category_val(data,'sex') data['smoker']=category_val(data,'smoker') data['region']=category_val(data,'region')

Model Building.

Next, You separate your data into x and y where y is the target variable.
X takes the rest of the features since we're using all the features to predict the target variable y which will only contain the target column.
x=data.drop('charges', axis=1) y=data['charges']

Next, we split our values of x and y into the training set and test set, the training set is the set used to train the model while the test set is used to evaluate the model based on what it's learnt in the training set. It is advisable to use between 20% to 30% of our data for the test set and the remaining 70% to 80% for our training set.in this article, we’ll use 30% for our test set.

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3, random_state=42)

We then load the linear regression model and fit our training set data into it

model=LinearRegression() model.fit(x_train,y_train)

We’ve successfully trained our machine learning model. Now how do we evaluate our data with the test set, there are several tools we can use to do this, more would be provided in future articles, we’ll use the mean squared error in this article.

Model Evaluation.

To evaluate the model, we’ll need to compare what the model predicts as the y_test values with the actual y_test results.
We create a variable named y_pred and use the model to predict the x_test values.

y_pred=model.predict(x_test)

Finally, we evaluate the values predicted using mean squared error.
mean_squared_error(y_test,y_pred)

Which gives a result of
33805466.898688614

there you have it, you just trained your first regression model, feels great?
Here's what happens underneath.

Linear regression uses an algorithm similar to the equation of a line
Y=mx+c
C is the intercept while x is a feature, m is the gradient and y is the target.

We can find how these relate using model.coef_ and model.intercept_

The intercept is given as -12364.39
And the coefficients were given in form of an array
Its easier to then represent the equation generated as this