Building a Machine Learning model using Multiple Linear regression
AngelaMunyao
Posted on January 27, 2022
I spend couple hours this early morning modelling this article for all Machine Learning enthusiasts, and especially those at the beginner-intermediate level.
One of the very necessary skills for ML engineers is to understand the concept of regression as related to volumes of data, both small data sets, and giant data sets.
This article covers a practical example on how to build an Machine Learning model using Multiple Linear Regression.
For this exercise, make sure to have Anaconda software installed, and from there, open Jupyter notebooks.
Download combine cycle power plant Data set From UCL Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant
Import the libraries below that we will be using:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pylab as pl
Enable inline plotting with Matplotlib
%matplotlib inline
Import the data: Unzip the downloaded zipped folder and make sure your data file exists in the same environment as your Notebook files:
data_df=pd.read_excel("Folds5x2_pp.xlsx")
You can now view your data with the command below (By default, it displays the first 5 rows)
data_df.head()
To recap the variables defined above:
AT refers to temperature in the range 1.81°c - 37.11°c
Exhaust vacuum V, in the range 25.36-81.56 cm Hg
Ambient Pressure (AP) in the range 992.89-1033.30 milibar
Relative Humidity (RH) in the range 25.56% -100.16%
Net hourly electrical energy (PE) in the range 420.26-495.76MW
Your dependent variable is PE.
Let's define X and Y, X being the independent variables and Y being the dependent variable.
To capture the independent variables, we need to use the function 'x=data_df.drop(['PE'], axis=1).values'.
The drop function excludes the independent variable PE,and 'axis=1' helps drop the column, and '.values' captures the x values.
To capture the dependent variable EP, we use 'y=data_df['PE'].values'
x=data_df.drop(['PE'], axis=1).values
y=data_df['PE'].values
Confirm X and Y values.
print(x)
print(y)
Split the data set into training and test set:
We use the function from Scikit library, 'train_test_split'
Import the train_test_split function.
from sklearn.model_selection import train_test_split
Devide your data into x_train, x_test, y_train, y_test.
x_train,x_test,y_train,y_test=train_test_split(x,y)
Voila! Your data is split into training and test set.
Next is to train the model using the training set. We will make use of linear regression.
from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(x_train,y_train)
After training the model, predict the test set results.
y_pred=model.predict(x_test)
Let's print the prediction results
print(y_pred)
The above prediction of PE is generated for all the rows in relation to the corresponding set of independent variables represented by X.
We can also execute as below, PE prediction per a specific one row set of x values (AT, V, AP, RH).
The example below is values from the first row of x values.
model.predict([[14.96,41.76,1024.07,73.17]])
Lets check how accurate our model is.
We need to import the function 'r2_score'.
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
The accuracy of our model is 92. :)
Next: Lets visualize the predicted results in a scatter plot
We already imported matplotlib which we are going to use.
#Make sure to import the Figure function which we will use to increase the scale of your graph so it doesn't appear too small
plt.figure(figsize=(15, 10))
plt.scatter(y_test,y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual PE Values vs. Predicted')
plt.show()
From our dataset, let's also create a comparison of the test values and the predicted values.
Use Pandas already imported as 'pd', to put the values into a data frame.
pred_comparison=pd.DataFrame({'Actual Value': y_test,'Predicted Value':y_pred, 'Diffrence':y_test-y_pred})
pred_comparison
Above is the 1st 5 and the last 5 rows from our data set.
To view the first 40 rows for more clarity, we use:
pred_comparison[0:40]
Awesomeee, that's our model right there! Research on more ways on how to improve the model. Bye!
Posted on January 27, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.