How to build machine learning regression models with Python
Hunter Johnson
Posted on May 9, 2023
This article was written by Najeeb Ul Hassan, a member of Educative's technical content team.
Marvel Comics introduced a fictional character Destiny in the 1980s, with the ability to foresee future occurrences. The exciting news is that predicting future events is no longer just a fantasy! With the progress made in machine learning, a machine can help forecast future events by utilizing the past.
Exciting, right? Let's start this journey with a simple prediction model.
A regression is a mathematical function that defines the relationship between a dependent variable and one or more independent variables. Rather than delving into theory, the focus will be on creating different regression models.
Understanding the input data
Before starting to build a regression model, one should examine the data. For instance, if an individual owns a fish farm and needs to predict a fish's weight based on its dimensions, they can explore the dataset by clicking the "RUN" button to display the top few rows of the DataFrame (Fish.txt
).
DataFrame (Fish.txt
):
Species Weight V-Length D-Length X-Length Height Width
Bream 290 24 26.3 31.2 12.48 4.3056
Bream 340 23.9 26.5 31.1 12.3778 4.6961
Bream 363 26.3 29 33.5 12.73 4.4555
Bream 430 26.5 29 34 12.444 5.134
Bream 450 26.8 29.7 34.7 13.6024 4.9274
Bream 500 26.8 29.7 34.5 14.1795 5.2785
Bream 390 27.6 30 35 12.67 4.69
Bream 450 27.6 30 35.1 14.0049 4.8438
Bream 500 28.5 30.7 36.2 14.2266 4.9594
Bream 475 28.4 31 36.2 14.2628 5.1042
Bream 500 28.7 31 36.2 14.3714 4.8146
Bream 500 29.1 31.5 36.4 13.7592 4.368
Bream 340 29.5 32 37.3 13.9129 5.0728
Bream 600 29.4 32 37.2 14.9544 5.1708
Bream 600 29.4 32 37.2 15.438 5.58
Bream 700 30.4 33 38.3 14.8604 5.2854
Bream 700 30.4 33 38.5 14.938 5.1975
Bream 610 30.9 33.5 38.6 15.633 5.1338
Bream 650 31 33.5 38.7 14.4738 5.7276
Bream 575 31.3 34 39.5 15.1285 5.5695
Bream 685 31.4 34 39.2 15.9936 5.3704
Bream 620 31.5 34.5 39.7 15.5227 5.2801
Bream 680 31.8 35 40.6 15.4686 6.1306
Bream 700 31.9 35 40.5 16.2405 5.589
Bream 725 31.8 35 40.9 16.36 6.0532
Bream 720 32 35 40.6 16.3618 6.09
Bream 714 32.7 36 41.5 16.517 5.8515
Bream 850 32.8 36 41.6 16.8896 6.1984
Bream 1000 33.5 37 42.6 18.957 6.603
Bream 920 35 38.5 44.1 18.0369 6.3063
Bream 955 35 38.5 44 18.084 6.292
Bream 925 36.2 39.5 45.3 18.7542 6.7497
Bream 975 37.4 41 45.9 18.6354 6.7473
Bream 950 38 41 46.5 17.6235 6.3705
Roach 40 12.9 14.1 16.2 4.1472 2.268
Roach 69 16.5 18.2 20.3 5.2983 2.8217
Roach 78 17.5 18.8 21.2 5.5756 2.9044
Roach 87 18.2 19.8 22.2 5.6166 3.1746
Roach 120 18.6 20 22.2 6.216 3.5742
Roach 0 19 20.5 22.8 6.4752 3.3516
Roach 110 19.1 20.8 23.1 6.1677 3.3957
Roach 120 19.4 21 23.7 6.1146 3.2943
Roach 150 20.4 22 24.7 5.8045 3.7544
Roach 145 20.5 22 24.3 6.6339 3.5478
Roach 160 20.5 22.5 25.3 7.0334 3.8203
Roach 140 21 22.5 25 6.55 3.325
Roach 160 21.1 22.5 25 6.4 3.8
Roach 169 22 24 27.2 7.5344 3.8352
Roach 161 22 23.4 26.7 6.9153 3.6312
Roach 200 22.1 23.5 26.8 7.3968 4.1272
Roach 180 23.6 25.2 27.9 7.0866 3.906
Roach 290 24 26 29.2 8.8768 4.4968
Roach 272 25 27 30.6 8.568 4.7736
Roach 390 29.5 31.7 35 9.485 5.355
Whitefish 270 23.6 26 28.7 8.3804 4.2476
Whitefish 270 24.1 26.5 29.3 8.1454 4.2485
Whitefish 306 25.6 28 30.8 8.778 4.6816
Whitefish 540 28.5 31 34 10.744 6.562
Whitefish 800 33.7 36.4 39.6 11.7612 6.5736
Whitefish 1000 37.3 40 43.5 12.354 6.525
Parkki 55 13.5 14.7 16.5 6.8475 2.3265
Parkki 60 14.3 15.5 17.4 6.5772 2.3142
Parkki 90 16.3 17.7 19.8 7.4052 2.673
Parkki 120 17.5 19 21.3 8.3922 2.9181
Parkki 150 18.4 20 22.4 8.8928 3.2928
Parkki 140 19 20.7 23.2 8.5376 3.2944
Parkki 170 19 20.7 23.2 9.396 3.4104
Parkki 145 19.8 21.5 24.1 9.7364 3.1571
Parkki 200 21.2 23 25.8 10.3458 3.6636
Parkki 273 23 25 28 11.088 4.144
Parkki 300 24 26 29 11.368 4.234
Perch 5.9 7.5 8.4 8.8 2.112 1.408
Perch 32 12.5 13.7 14.7 3.528 1.9992
Perch 40 13.8 15 16 3.824 2.432
Perch 51.5 15 16.2 17.2 4.5924 2.6316
Perch 70 15.7 17.4 18.5 4.588 2.9415
Perch 100 16.2 18 19.2 5.2224 3.3216
Perch 78 16.8 18.7 19.4 5.1992 3.1234
Perch 80 17.2 19 20.2 5.6358 3.0502
Perch 85 17.8 19.6 20.8 5.1376 3.0368
Perch 85 18.2 20 21 5.082 2.772
Perch 110 19 21 22.5 5.6925 3.555
Perch 115 19 21 22.5 5.9175 3.3075
Perch 125 19 21 22.5 5.6925 3.6675
Perch 130 19.3 21.3 22.8 6.384 3.534
Perch 120 20 22 23.5 6.11 3.4075
Perch 120 20 22 23.5 5.64 3.525
Perch 130 20 22 23.5 6.11 3.525
Perch 135 20 22 23.5 5.875 3.525
Perch 110 20 22 23.5 5.5225 3.995
Perch 130 20.5 22.5 24 5.856 3.624
Perch 150 20.5 22.5 24 6.792 3.624
Perch 145 20.7 22.7 24.2 5.9532 3.63
Perch 150 21 23 24.5 5.2185 3.626
Perch 170 21.5 23.5 25 6.275 3.725
Perch 225 22 24 25.5 7.293 3.723
Perch 145 22 24 25.5 6.375 3.825
Perch 188 22.6 24.6 26.2 6.7334 4.1658
Perch 180 23 25 26.5 6.4395 3.6835
Perch 197 23.5 25.6 27 6.561 4.239
Perch 218 25 26.5 28 7.168 4.144
Perch 300 25.2 27.3 28.7 8.323 5.1373
Perch 260 25.4 27.5 28.9 7.1672 4.335
Perch 265 25.4 27.5 28.9 7.0516 4.335
Perch 250 25.4 27.5 28.9 7.2828 4.5662
Perch 250 25.9 28 29.4 7.8204 4.2042
Perch 300 26.9 28.7 30.1 7.5852 4.6354
Perch 320 27.8 30 31.6 7.6156 4.7716
Perch 514 30.5 32.8 34 10.03 6.018
Perch 556 32 34.5 36.5 10.2565 6.3875
Perch 840 32.5 35 37.3 11.4884 7.7957
Perch 685 34 36.5 39 10.881 6.864
Perch 700 34 36 38.3 10.6091 6.7408
Perch 700 34.5 37 39.4 10.835 6.2646
Perch 690 34.6 37 39.3 10.5717 6.3666
Perch 900 36.5 39 41.4 11.1366 7.4934
Perch 650 36.5 39 41.4 11.1366 6.003
Perch 820 36.6 39 41.3 12.4313 7.3514
Perch 850 36.9 40 42.3 11.9286 7.1064
Perch 900 37 40 42.5 11.73 7.225
Perch 1015 37 40 42.4 12.3808 7.4624
Perch 820 37.1 40 42.5 11.135 6.63
Perch 1100 39 42 44.6 12.8002 6.8684
Perch 1000 39.8 43 45.2 11.9328 7.2772
Perch 1100 40.1 43 45.5 12.5125 7.4165
Perch 1000 40.2 43.5 46 12.604 8.142
Perch 1000 41.1 44 46.6 12.4888 7.5958
Pike 200 30 32.3 34.8 5.568 3.3756
Pike 300 31.7 34 37.8 5.7078 4.158
Pike 300 32.7 35 38.8 5.9364 4.3844
Pike 300 34.8 37.3 39.8 6.2884 4.0198
Pike 430 35.5 38 40.5 7.29 4.5765
Pike 345 36 38.5 41 6.396 3.977
Pike 456 40 42.5 45.5 7.28 4.3225
Pike 510 40 42.5 45.5 6.825 4.459
Pike 540 40.1 43 45.8 7.786 5.1296
Pike 500 42 45 48 6.96 4.896
Pike 567 43.2 46 48.7 7.792 4.87
Pike 770 44.8 48 51.2 7.68 5.376
Pike 950 48.3 51.7 55.1 8.9262 6.1712
Pike 1250 52 56 59.7 10.6863 6.9849
Pike 1600 56 60 64 9.6 6.144
Pike 1550 56 60 64 9.6 6.144
Pike 1650 59 63.4 68 10.812 7.48
Smelt 6.7 9.3 9.8 10.8 1.7388 1.0476
Smelt 7.5 10 10.5 11.6 1.972 1.16
Smelt 7 10.1 10.6 11.6 1.7284 1.1484
Smelt 9.7 10.4 11 12 2.196 1.38
Smelt 9.8 10.7 11.2 12.4 2.0832 1.2772
Smelt 8.7 10.8 11.3 12.6 1.9782 1.2852
Smelt 10 11.3 11.8 13.1 2.2139 1.2838
Smelt 9.9 11.3 11.8 13.1 2.2139 1.1659
Smelt 9.8 11.4 12 13.2 2.2044 1.1484
Smelt 12.2 11.5 12.2 13.4 2.0904 1.3936
Smelt 13.4 11.7 12.4 13.5 2.43 1.269
Smelt 12.2 12.1 13 13.8 2.277 1.2558
Smelt 19.7 13.2 14.3 15.2 2.8728 2.0672
Smelt 19.9 13.8 15 16.2 2.9322 1.8792
Executable code:
# Step 1: Importing libraries
import pandas as pd
# Step 2: Defining the columns of and reading our DataFrame
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)
# Printing the head of our DataFrame
print(Fish.head())
Output:
Species Weight V-Length D-Length X-Length Height Width
0 Bream 290.0 24.0 26.3 31.2 12.4800 4.3056
1 Bream 340.0 23.9 26.5 31.1 12.3778 4.6961
2 Bream 363.0 26.3 29.0 33.5 12.7300 4.4555
3 Bream 430.0 26.5 29.0 34.0 12.4440 5.1340
4 Bream 450.0 26.8 29.7 34.7 13.6024 4.9274
- Line 2: pandas library is imported to read DataFrame.
Line 6: Read the data from the
Fish.txt
file with columns defined in line 5.Line 9: Prints the top five rows of the DataFrame. The three lengths define the vertical, diagonal, and cross lengths in cm.
Here, the fish's length, height, and width are independent variables, with weight serving as the dependent variable. In machine learning, independent variables are often referred to as features and dependent variables as labels, and these terms will be used interchangeably throughout this blog.
Linear regression
Linear regression models are widely used in statistics and machine learning. These models use a straight line to describe the relationship between an independent variable and a dependent variable. For example, when analyzing the weight of fish, a linear regression model is used to describe the relationship between the weight y of the fish and one of the independent variables X as follows,
Where m is the slope of the line that defines its steepness, and c is the y-intercept, the point where line crosses the y-axis.
Selecting feature
The dataset contains five independent variables. A simple linear regression model with only one feature can be initiated by selecting the most strongly related feature to the fish's Weight
. One approach to accomplish this is to calculate the cross-correlation between Weight
and the features.
Hidden code: (From the previous code block)
# Step 1: Importing libraries
import pandas as pd
# Step 2: Defining the columns of and reading our data frame
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)
Executable code:
# Finding the cross-correlation matrix
print(Fish.corr())
Output:
Weight V-Length D-Length X-Length Height Width
Weight 1.000000 0.915691 0.918625 0.923343 0.727260 0.886546
V-Length 0.915691 1.000000 0.999519 0.992155 0.627425 0.867002
D-Length 0.918625 0.999519 1.000000 0.994199 0.642392 0.873499
X-Length 0.923343 0.992155 0.994199 1.000000 0.704628 0.878548
Height 0.727260 0.627425 0.642392 0.704628 1.000000 0.794810
Width 0.886546 0.867002 0.873499 0.878548 0.794810 1.000000
Ater examining the first column, the following is observed:
- There is a strong correlation between
Weight
, and the featureX-Length
. - The
Weight
has the weakest correlation withHeight
.
Given this information, it is clear that if the individual is limited to using only one independent variable to predict the dependent variable, they should choose X-Length
and not Height
.
# Step 3: Separating the data into features and labels
X = Fish[['X-Length']]
y = Fish['Weight']
Splitting data
With the features and labels in place, DataFrame can now be divided into training and test sets.
The training dataset trains the model, while the test dataset evaluates its performance.
The train_test_split
function is imported from the sklearn
library to split the data.
from sklearn.model_selection import train_test_split
# Step 4: Dividing the dataset into test and train data
X_train, X_test, y_train, y_test =
train_test_split(
X, y,
test_size=0.3,
random_state=10,
shuffle=True
)
The arguments of the train_test_split
function can be examined as follows:
- Line 6: Pass the feature and the label.
-
Line 7: Use
test_size=0.3
to select 70% of the data for training and the remaining 30% for testing purposes. -
Lines 8–9: Make the split random and use
shuffle=True
to ensure that the model is not overfitting to a specific set of data.
As a result, the training data in variables X_train
and y_train
and test data in X_test
and y_test
is obtained.
Applying model
At this point, the linear regression model can be created.
Hidden code:
# Step 1: Importing libraries
import pandas as pd
# 1.2
from sklearn.model_selection import train_test_split
# Step 2: Defining the columns of and reading our data frame
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)
# Step 3: Seperating the data into features and labels
X = Fish[['X-Length']]
y = Fish['Weight']
# Step 4: Dividing the data into test and train set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)
Executable code:
from sklearn.linear_model import LinearRegression
# Step 5: Selecting the linear regression method from scikit-learn library
model = LinearRegression().fit(X_train, y_train)
-
Line 1: The
LinearRegression
function fromsklearn
library is imported. -
Line 4: Creates and train the model using the training data
X_train
andy_train
.
Model validation
Remember, 30% of the data was set aside for testing. The Mean Absolute Error (MAE) can be calculated using this data as an indicator of the average absolute difference between the predicted and actual values, with a lower MAE value indicating more accurate predictions. Other measures for model validation exist, but they won't be explored in this context.
Here's a complete running example, including all of the previously mentioned steps mentioned above to perform a linear regression.
# Step 1: Importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
# Step 2: Defining the columns of and reading the DataFrame
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)
# Step 3: Seperating the data into features and labels
X = Fish[['X-Length']]
y = Fish['Weight']
# Step 4: Dividing the dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)
# Step 5: Selecting the linear regression method from the scikit-learn library
model = LinearRegression().fit(X_train, y_train)
# Step 6: Validation
# Evaluating the trained model on training data
y_prediction = model.predict(X_train)
print("MAE on train data= " , metrics.mean_absolute_error(y_train, y_prediction))
# Evaluating the trained model on test data
y_prediction = model.predict(X_test)
print("MAE on test data = " , metrics.mean_absolute_error(y_test, y_prediction))
Output:
('MAE on train data= ', 105.08242420291623)
('MAE on test data = ', 108.7817508976745)
In this instance, the model.predict()
function is applied to the training data on line 23, and on line 26, it is used on the test data. But what does it show?
Essentially, this approach demonstrates the model’s performance on a known dataset when compared to an unfamiliar test dataset.
The two MAE values suggest that the predictions on both train and test data are similar.
Note: It is essential to recall that the
X-Length
was chosen as the feature because of its high correlation with the label. To verify the choice of feature, one can replace it with theHeight
on line 12 and rerun the linear regression, then compare the two MAE values.
Multiple linear regression
So far, only one feature, X-Length
has been used to train the model. However, there are features available that can be utilized to improve the predictions. These features include the vertical length, diagonal length, height, and width of the fish, and can be used to re-evaluate the linear regression model.
# Step 3: Separating the data into features and labels
X = Fish[['V-Length', 'D-Length', 'X-Length', 'Height', 'Width']]
y = Fish['Weight']
Mathematically, the multiple linear regression model can be written as follows:
where m_i represents the weightage for feature X_i in predicting y and n denotes the number of features.
Following the similar steps as earlier, the performance of the model can be calculated by utilizing all the features.
# Step 1: Importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
# Step 2: Defining the columns and reading the DataFrame
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)
# Step 3: Seperating the data into features and labels
X = Fish[['V-Length', 'D-Length', 'X-Length', 'Height', 'Width']]
y = Fish['Weight']
# Step 4: Dividing the dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)
# Step 5: Selecting the linear regression method from the scikit-learn library
model = LinearRegression().fit(X_train, y_train)
# Step 6: Validation
# Evaluating the trained model on training data
y_prediction = model.predict(X_train)
print("MAE on train data= " , metrics.mean_absolute_error(y_train, y_prediction))
# Evaluating the trained model on test data
y_prediction = model.predict(X_test)
print("MAE on test data = " , metrics.mean_absolute_error(y_test, y_prediction))
Output:
('MAE on train data= ', 88.6176233769433)
('MAE on test data = ', 104.71922684746642)
The MAE values will be similar to the results obtained when using a single feature.
Polynomial regression
This blog explains the concept of polynomial regression, which is used when the assumption of a linear relationship between the features and label is not accurate. By allowing for a more flexible fit to the data, polynomial regression can capture more complex relationships and lead to more accurate predictions.
For example, if the relationship between the dependent variables and the independent variable is not a straight line, a polynomial regression model can be used to model it more accurately. This can lead to a better fit to the data and more accurate predictions.
Mathematically, the relationship between dependent and independent variables is described using the following equation:
The above equation looks very similar to the one used earlier to describe multiple linear regression. However, it includes the transformed features called Z_i's which are the polynomial version of X_i's used in multiple linear regression.
This can be further explained using an example of two features X_1 and X_2 to create new features, such as:
The new polynomial features can be created based on trial and error or techniques like cross-validation. The degree of the polynomial can also be chosen based on the complexity of the relationship between the variables.
The following example presents a polynomial regression and validates the models' performance.
# Step 1: Importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.preprocessing import PolynomialFeatures
# Step 2: Defining the columns and reading the DataFrame
columns = ['Species', 'Weight', 'V-Length', 'D-Length', 'X-Length', 'Height', 'Width']
Fish = pd.read_csv('Fish.txt', sep='\t', usecols=columns)
# Step 3: Seperating the data into features and labels
X = Fish[['V-Length', 'D-Length', 'X-Length', 'Height', 'Width']]
y = Fish['Weight']
# Step 4: Generating polynomial features
Z = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X)
# Dividing the dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(Z, y, test_size=0.3, random_state=10)
# Step 5: Selecting the linear regression method from the scikit-learn library
model = LinearRegression().fit(X_train, y_train)
# Step 6: Validation
# Evaluating the trained model on training data
y_prediction = model.predict(X_train)
print("MAE on train data= " , metrics.mean_absolute_error(y_train, y_prediction))
# Evaluating our trained model on test data
y_prediction = model.predict(X_test)
print("MAE on test data = " , metrics.mean_absolute_error(y_test, y_prediction))
Output:
('MAE on train data= ', 30.44121990999409)
('MAE on test data = ', 32.558434580499224)
The features were transformed using PolynomialFeatures
function on line 18. The PolynomialFeatures
function, imported from the sklearn
library on line 7, was used for this purpose.
It should be noticed that the MAE value in this case is superior to that of linear regression models, implying that the linear assumption was not entirely accurate.
This blog has provided a quick introduction to Machine learning regression models with python. Don't stop here! Explore and practice different techniques and libraries to build more accurate and robust models. You can also check out the following courses and skill paths on Educative:
Good luck, and happy learning!
Posted on May 9, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.