Brushing Up on Linear Regression in Python: Theory to Practice

esakik

Koki Esaki

Posted on February 1, 2024

Brushing Up on Linear Regression in Python: Theory to Practice

Having completed an extensive machine learning course, I've noticed that my memory of the material is starting to diminish. To address this, I've made the decision to write a series of articles.

Introduction

Assuming the x-axis represents age and the y-axis indicates income, it appears possible to somehow express the data plotted with a linear function.

Linear Regression with One Variable

The blue line is merely a visual guide and is not based on mathematical accuracy; therefore, we need to do this and that to determine the actual equation of this blue line.

Hypothesis Function

Adjust the free parameters θ0,θ1\theta_0, \theta_1 of the function to formulate an expression that most accurately fits the data with minimal error.

hθ(x)=θ0+θ1x1 h_\theta(x) = \theta_0 + \theta_1x_1

For scenarios involving multiple variables, the formula would be structured as follows. This no longer represents a linear function, yet the foundational principle stays the same.

hθ(x)=θ0+θ1x1+θ2x2+θ3x3+...+θnxn h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + ... + \theta_nx_n

Cost Function

The cost function is a tool utilized to develop the hypothetical function. Simply put, it calculates the average discrepancy between the predicted results and the actual outputs. By determining the parameter θθ that minimizes the error, the true parameters of the hypothetical function can be ascertained. This method is commonly known as the "mean squared error (MSE)".

J(θ)=12mi=1m(hθ(x(i))y(i))2 J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2

The division by 2 in the function J(θ)J(\theta) is implemented to simplify the process of differentiation when calculating the function later.

Optimization Using Gradient Descent

A strategy must be formulated to optimize (in this instance, minimize) the performance of the cost function, aiming for the most favorable results.

The minimization of the mean squared error occurs when the derivative of this function equals zero. This procedure is depicted by the following update formula, known as the gradient descent method.

Gradient Descent 2D

This method persistently applies this update until the parameter values reach a point of convergence:

θj:=θjαθjJ(θ)=θjα1mi=1m(hθ(x(i))y(i))xj(i) \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) = \theta_j - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}

Partial differentiation will be applied, where one variable is differentiated while treating the other variables as constants. The resulting gradient is then multiplied by α, which is called the learning rate, and subtracted from the original θjθ_j to derive the updated θjθ_j . As the gradient approaches 0, whatever the value of α, the range of variation of θjθ_j becomes smaller and closer to 0. When the range of variation becomes small enough, it is called convergence.

Gradient Descent 3D

Notes, if the value of α is excessively high, the variation in θjθ_j becomes too large, potentially leading to a failure in convergence. On the other hand, a smaller α results in a slower yet more reliable convergence. Additionally, the update of θ0,θ1,...,θj\theta_0, \theta_1, ..., \theta_j should be done at the same time, as this is a fundamental requirement for the process.

Implementation of Linear Regression

In this section, we will develop a linear regression model utilizing the gradient descent technique. We will use the California Housing dataset from the scikit-learn library for this example.

To begin, we will import the necessary libraries and load the dataset.

pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2 matplotlib==3.7.4 seaborn==0.13.2
Enter fullscreen mode Exit fullscreen mode
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection

dataset = datasets.fetch_california_housing()
X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.Series(data=dataset.target, name="target")

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)
Enter fullscreen mode Exit fullscreen mode

The dataset comprises 9 features and 20,640 samples. The target variable is the median house value within each block, expressed in units of 100,000 USD.

The code provided next will generate a plot of the correlation matrix for this dataset.

import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = pd.concat([X, y], axis=1).corr()

plt.figure(figsize=(9, 6))
plt.title("Correlation Matrix")
sns.heatmap(corr_matrix, annot=True, square=True, cmap="Blues", fmt=".2f", linewidths=.5)
plt.savefig("california_housing_corr_matrix.png")
Enter fullscreen mode Exit fullscreen mode

california_housing_corr_matrix.png

The correlation matrix reveals that the median income is the most strongly correlated with the target variable. The correlation between the target variable and the other features is comparatively lower. However, for simplicity in this example, all features will be used.

We are now set to build the linear regression model. To gain a deeper understanding of its mechanics, we'll create it from the ground up, without relying on pre-existing machine learning libraries.

class LinearRegression:

    def __init__(self, alpha: float = 1e-7, eps: float = 1e-4) -> None:
        self.alpha = alpha  # Learning rate for gradient descent
        self.eps = eps  # Threshold of convergence

    def fit(self, X: pd.DataFrame, y: pd.Series) -> "LinearRegression":
        """Train the model. Optimization method is gradient descent.

        :param X: The feature values of the training data.
        :param y: The target values of the training data.
        :return: The trained model.
        """
        self._m = X.shape[0]  # The number of samples
        num_features = X.shape[1]  # The number of features

        self._theta = np.zeros(num_features)  # Parameters (weight) of the model (without bias)
        self._theta0 = np.zeros(1)  # Bias of the model

        self._error_values = []  # The output values of the cost function in each iteration
        self._grad_values = []  # Gradient values in each iteration
        self._iter_counter = 0  # The counter of iterations

        error = self.J(X, y)  # The initial output value of the cost function with random parameters
        diff = 1.0  # The difference between the previous and the current output values of the cost function

        # Repeat until convergence
        while diff > self.eps:
            # Update the parameters by gradient descent
            y_pred = self.predict(X)  # Predict the target values with the current parameters
            grad = (1 / self._m) * np.dot(y_pred - y, X)  # Calculate the gradient using the formula
            self._theta -= self.alpha * grad  # Update the parameters
            self._theta0 -= (1 / self._m) * np.sum(y_pred - y)  # Update the bias

            # Print the current status
            _error = self.J(X, y)  # Compute the error with the updated parameters
            diff = abs(error - _error)  # Compute the difference between the previous and the current error
            error = _error  # Update the error
            self._error_values.append(error)
            self._grad_values.append(grad.sum())
            self._iter_counter += 1
            print(f"[{self._iter_counter}] error: {error}, diff: {diff}, grad: {grad.sum()}")
        print(f"Convergence in {self._iter_counter} iterations.")
        return self

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict the target values using the hypothesis function.

        :param X: The feature values of the data.
        :return: The predicted target values.
        """
        # Pass the bias and the parameters to the hypothesis function
        theta = np.concatenate([self._theta0, self._theta])
        return self.h(X, theta)

    def h(self, X: pd.DataFrame, theta: np.ndarray) -> np.ndarray:
        """Hypothesis function.

        :param X: The feature values of the data.
        :param theta: The parameters (weight) of the model.
        :return: The predicted target values.
        """
        # theta[0] is bias and theta[1:] is parameters
        return np.dot(X, theta[1:].T) + theta[0]

    def J(self, X: pd.DataFrame, y: pd.Series) -> float:
        """Cost function. Mean squared error (MSE).

        :param X: The feature values of the data.
        :param y: The target values of the data.
        :return: The error value.
        """
        y_pred = self.predict(X)  # Predict the target values with the current parameters
        return (1 / (2 * self._m)) * np.sum((y_pred - y) ** 2)  # Compute the error using the formula
Enter fullscreen mode Exit fullscreen mode

The code includes comprehensive explanations in the form of comments. For further understanding, please refer to these comments. Next, we will proceed to train the model and assess its performance on both the training and test data sets.

model = LinearRegression()
model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

The results of the training process can be visualized using the following plotting method.

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(15, 5))

ax = fig.add_subplot(1, 2, 1)
ax.set_title("MSE")
ax.set_ylabel("Error")
ax.set_xlabel("Iteration")
ax.plot(np.arange(model._iter_counter), model._error_values, color="b")

ax = fig.add_subplot(1, 2, 2)
ax.set_title("Gradient")
ax.set_ylabel("Gradient")
ax.set_xlabel("Iteration")
ax.plot(np.arange(model._iter_counter), model._grad_values, color="r")

plt.show()
Enter fullscreen mode Exit fullscreen mode

Training Process

Now, let's evaluate the model on both the training and test data sets.

from sklearn.metrics import mean_squared_error

y_train_pred = model.predict(X_train)
print(f"MSE for train data: {mean_squared_error(y_train, y_train_pred)}")
y_test_pred = model.predict(X_test)
print(f"MSE for test data: {mean_squared_error(y_test, y_test_pred)}")
Enter fullscreen mode Exit fullscreen mode

The Mean Squared Error (MSE) for the training data stands at 1.33, while for the test data, it is 1.32. The marginally lower MSE for the test data suggests that the model is not overfitting, which is a positive indication of its generalization capability.

MSE for train data: 1.3350646294600155
MSE for test data: 1.322791709774531
Enter fullscreen mode Exit fullscreen mode

By using the scikit-learn library, the same model can be implemented with a more streamlined code approach. This allows for an efficient and more straightforward way to achieve almost the same results.

from sklearn.linear_model import LinearRegression as SklearnLinearRegression

sklearn_model = SklearnLinearRegression()
sklearn_model.fit(X_train, y_train)

sklearn_y_train_pred = sklearn_model.predict(X_train)
print(f"MSE for train data: {mean_squared_error(y_train, sklearn_y_train_pred)}")
sklearn_y_test_pred = sklearn_model.predict(X_test)
print(f"MSE for test data: {mean_squared_error(y_test, sklearn_y_test_pred)}")
Enter fullscreen mode Exit fullscreen mode
MSE for train data: 0.5192270684511334
MSE for test data: 0.5404128061709085
Enter fullscreen mode Exit fullscreen mode

Regularization

Regularization decreases the weights to prevent overfitting by making it difficult for any feature to have a high value. It seeks to find the optimal set of weights that enhance the cost function's performance within a given constraint.

Ridge Regression

Ridge Regression is one of the linear regression methods. The equations used for prediction are the same as those in linear regression, but L2 regularization is used to avoid over-fitting. It has high generalization performance by keeping each weight as close to zero as possible.

Cost function:

J(θ)=12m(i=1m(hθ(x(i))y(i))2+λj=1nθj2) J(\theta) = \frac{1}{2m}(\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^{n}\theta_j^2)

Gradient descent with Regularization:

θj:=θjαθjJ(θ)=θjα(1mi=1m(hθ(x(i))y(i))xj(i)+λmθj) \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) = \theta_j - \alpha (\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} + \frac{\lambda}{m}\theta_j)

Thus, ridge regression uses the L2 norm for the regularization, which is calculated with the Euclidean distance:

d=(b1a1)2+(b2a2)2 d = \sqrt{(b_1 - a_1)^2 + (b_2 - a_2)^2}

Lasso Regression

Lasso regression applies L1 regularization, leading to some weights becoming zero. This results in certain features being entirely excluded from the model. With some weights set to zero, the model simplifies and clarifies which features are significant.

Cost function:

J(θ)=12m(i=1m(hθ(x(i))y(i))2+λj=1nθj) J(\theta) = \frac{1}{2m}(\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^{n}|\theta_j|)

Thus, lasso regression uses the L1 norm for the regularization, which is calculated with the Manhattan distance:

d=(b1a1)+(b2a2) d = |(b_1 - a_1)| + |(b_2 - a_2)|

References

💖 💪 🙅 🚩
esakik
Koki Esaki

Posted on February 1, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related