Brushing Up on Linear Regression in Python: Theory to Practice
Koki Esaki
Posted on February 1, 2024
Having completed an extensive machine learning course, I've noticed that my memory of the material is starting to diminish. To address this, I've made the decision to write a series of articles.
Introduction
Assuming the x-axis represents age and the y-axis indicates income, it appears possible to somehow express the data plotted with a linear function.
The blue line is merely a visual guide and is not based on mathematical accuracy; therefore, we need to do this and that to determine the actual equation of this blue line.
Hypothesis Function
Adjust the free parameters of the function to formulate an expression that most accurately fits the data with minimal error.
For scenarios involving multiple variables, the formula would be structured as follows. This no longer represents a linear function, yet the foundational principle stays the same.
Cost Function
The cost function is a tool utilized to develop the hypothetical function. Simply put, it calculates the average discrepancy between the predicted results and the actual outputs. By determining the parameter that minimizes the error, the true parameters of the hypothetical function can be ascertained. This method is commonly known as the "mean squared error (MSE)".
The division by 2 in the function is implemented to simplify the process of differentiation when calculating the function later.
Optimization Using Gradient Descent
A strategy must be formulated to optimize (in this instance, minimize) the performance of the cost function, aiming for the most favorable results.
The minimization of the mean squared error occurs when the derivative of this function equals zero. This procedure is depicted by the following update formula, known as the gradient descent method.
This method persistently applies this update until the parameter values reach a point of convergence:
Partial differentiation will be applied, where one variable is differentiated while treating the other variables as constants. The resulting gradient is then multiplied by α, which is called the learning rate, and subtracted from the original to derive the updated . As the gradient approaches 0, whatever the value of α, the range of variation of becomes smaller and closer to 0. When the range of variation becomes small enough, it is called convergence.
Notes, if the value of α is excessively high, the variation in becomes too large, potentially leading to a failure in convergence. On the other hand, a smaller α results in a slower yet more reliable convergence. Additionally, the update of should be done at the same time, as this is a fundamental requirement for the process.
Implementation of Linear Regression
In this section, we will develop a linear regression model utilizing the gradient descent technique. We will use the California Housing dataset from the scikit-learn library for this example.
To begin, we will import the necessary libraries and load the dataset.
pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2 matplotlib==3.7.4 seaborn==0.13.2
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection
dataset = datasets.fetch_california_housing()
X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.Series(data=dataset.target, name="target")
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)
The dataset comprises 9 features and 20,640 samples. The target variable is the median house value within each block, expressed in units of 100,000 USD.
The code provided next will generate a plot of the correlation matrix for this dataset.
import seaborn as sns
import matplotlib.pyplot as plt
corr_matrix = pd.concat([X, y], axis=1).corr()
plt.figure(figsize=(9, 6))
plt.title("Correlation Matrix")
sns.heatmap(corr_matrix, annot=True, square=True, cmap="Blues", fmt=".2f", linewidths=.5)
plt.savefig("california_housing_corr_matrix.png")
The correlation matrix reveals that the median income is the most strongly correlated with the target variable. The correlation between the target variable and the other features is comparatively lower. However, for simplicity in this example, all features will be used.
We are now set to build the linear regression model. To gain a deeper understanding of its mechanics, we'll create it from the ground up, without relying on pre-existing machine learning libraries.
class LinearRegression:
def __init__(self, alpha: float = 1e-7, eps: float = 1e-4) -> None:
self.alpha = alpha # Learning rate for gradient descent
self.eps = eps # Threshold of convergence
def fit(self, X: pd.DataFrame, y: pd.Series) -> "LinearRegression":
"""Train the model. Optimization method is gradient descent.
:param X: The feature values of the training data.
:param y: The target values of the training data.
:return: The trained model.
"""
self._m = X.shape[0] # The number of samples
num_features = X.shape[1] # The number of features
self._theta = np.zeros(num_features) # Parameters (weight) of the model (without bias)
self._theta0 = np.zeros(1) # Bias of the model
self._error_values = [] # The output values of the cost function in each iteration
self._grad_values = [] # Gradient values in each iteration
self._iter_counter = 0 # The counter of iterations
error = self.J(X, y) # The initial output value of the cost function with random parameters
diff = 1.0 # The difference between the previous and the current output values of the cost function
# Repeat until convergence
while diff > self.eps:
# Update the parameters by gradient descent
y_pred = self.predict(X) # Predict the target values with the current parameters
grad = (1 / self._m) * np.dot(y_pred - y, X) # Calculate the gradient using the formula
self._theta -= self.alpha * grad # Update the parameters
self._theta0 -= (1 / self._m) * np.sum(y_pred - y) # Update the bias
# Print the current status
_error = self.J(X, y) # Compute the error with the updated parameters
diff = abs(error - _error) # Compute the difference between the previous and the current error
error = _error # Update the error
self._error_values.append(error)
self._grad_values.append(grad.sum())
self._iter_counter += 1
print(f"[{self._iter_counter}] error: {error}, diff: {diff}, grad: {grad.sum()}")
print(f"Convergence in {self._iter_counter} iterations.")
return self
def predict(self, X: pd.DataFrame) -> np.ndarray:
"""Predict the target values using the hypothesis function.
:param X: The feature values of the data.
:return: The predicted target values.
"""
# Pass the bias and the parameters to the hypothesis function
theta = np.concatenate([self._theta0, self._theta])
return self.h(X, theta)
def h(self, X: pd.DataFrame, theta: np.ndarray) -> np.ndarray:
"""Hypothesis function.
:param X: The feature values of the data.
:param theta: The parameters (weight) of the model.
:return: The predicted target values.
"""
# theta[0] is bias and theta[1:] is parameters
return np.dot(X, theta[1:].T) + theta[0]
def J(self, X: pd.DataFrame, y: pd.Series) -> float:
"""Cost function. Mean squared error (MSE).
:param X: The feature values of the data.
:param y: The target values of the data.
:return: The error value.
"""
y_pred = self.predict(X) # Predict the target values with the current parameters
return (1 / (2 * self._m)) * np.sum((y_pred - y) ** 2) # Compute the error using the formula
The code includes comprehensive explanations in the form of comments. For further understanding, please refer to these comments. Next, we will proceed to train the model and assess its performance on both the training and test data sets.
model = LinearRegression()
model.fit(X_train, y_train)
The results of the training process can be visualized using the following plotting method.
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(15, 5))
ax = fig.add_subplot(1, 2, 1)
ax.set_title("MSE")
ax.set_ylabel("Error")
ax.set_xlabel("Iteration")
ax.plot(np.arange(model._iter_counter), model._error_values, color="b")
ax = fig.add_subplot(1, 2, 2)
ax.set_title("Gradient")
ax.set_ylabel("Gradient")
ax.set_xlabel("Iteration")
ax.plot(np.arange(model._iter_counter), model._grad_values, color="r")
plt.show()
Now, let's evaluate the model on both the training and test data sets.
from sklearn.metrics import mean_squared_error
y_train_pred = model.predict(X_train)
print(f"MSE for train data: {mean_squared_error(y_train, y_train_pred)}")
y_test_pred = model.predict(X_test)
print(f"MSE for test data: {mean_squared_error(y_test, y_test_pred)}")
The Mean Squared Error (MSE) for the training data stands at 1.33, while for the test data, it is 1.32. The marginally lower MSE for the test data suggests that the model is not overfitting, which is a positive indication of its generalization capability.
MSE for train data: 1.3350646294600155
MSE for test data: 1.322791709774531
By using the scikit-learn library, the same model can be implemented with a more streamlined code approach. This allows for an efficient and more straightforward way to achieve almost the same results.
from sklearn.linear_model import LinearRegression as SklearnLinearRegression
sklearn_model = SklearnLinearRegression()
sklearn_model.fit(X_train, y_train)
sklearn_y_train_pred = sklearn_model.predict(X_train)
print(f"MSE for train data: {mean_squared_error(y_train, sklearn_y_train_pred)}")
sklearn_y_test_pred = sklearn_model.predict(X_test)
print(f"MSE for test data: {mean_squared_error(y_test, sklearn_y_test_pred)}")
MSE for train data: 0.5192270684511334
MSE for test data: 0.5404128061709085
Regularization
Regularization decreases the weights to prevent overfitting by making it difficult for any feature to have a high value. It seeks to find the optimal set of weights that enhance the cost function's performance within a given constraint.
Ridge Regression
Ridge Regression is one of the linear regression methods. The equations used for prediction are the same as those in linear regression, but L2 regularization is used to avoid over-fitting. It has high generalization performance by keeping each weight as close to zero as possible.
Cost function:
Gradient descent with Regularization:
Thus, ridge regression uses the L2 norm for the regularization, which is calculated with the Euclidean distance:
Lasso Regression
Lasso regression applies L1 regularization, leading to some weights becoming zero. This results in certain features being entirely excluded from the model. With some weights set to zero, the model simplifies and clarifies which features are significant.
Cost function:
Thus, lasso regression uses the L1 norm for the regularization, which is calculated with the Manhattan distance:
References
Posted on February 1, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.