Exploring Gradient Descent Variants, and Fundamentals of Implementation

Introduction

After grasping the concepts of linear regression and its optimization technique, gradient descent, in the previous article, here's an opportunity to dive deeper into gradient descent for a more comprehensive understanding.

Types of Gradient Descent

The gradient descent methods can be broadly categorized into three primary types.

Batch Gradient Descent

In this optimization technique, the entire dataset is used to compute the gradient of the cost function. Essentially, it involves evaluating the loss and updating model parameters once per epoch (a complete pass through the dataset). Batch Gradient Descent is computationally efficient for small to medium-sized datasets but can be slow for large datasets.

Stochastic Gradient Descent, SGD

Unlike Batch Gradient Descent, SGD processes one training example at a time to calculate the gradient. This results in frequent updates to the model parameters and introduces more randomness into the optimization process. While it can be faster and can escape local minima, it may exhibit more oscillations in convergence due to the noise from individual data points.

Mini-batch Gradient Descent

Mini-batch Gradient Descent strikes a balance between Batch and Stochastic Gradient Descent. It divides the dataset into smaller subsets called mini-batches. The gradient is calculated and model parameters are updated after processing each mini-batch. This approach combines some benefits of both previous methods: it's computationally efficient and introduces some noise for faster convergence.

Implementation of Gradient Descent

For the scope of this article, our primary focus will be on the batch gradient descent method. Let's consider a simple example to illustrate the gradient descent method. We will use the following cost function:

f(x, y) = 3x^2 - 2xy + 3y^2 + 5x - 5y

The partial derivatives of $f(x, y)$ with respect to $x$ and $y$ are:

\nabla f(x, y) = \begin{bmatrix} 6x - 2y + 5 \ -2x + 6y - 5 \end{bmatrix}

Here, we will minimize the cost function $f(x, y)$ using the gradient descent method. We will start with an initial point $(x_0, y_0) = (1.0, 1.0)$ .

To begin, let's import the required libraries and define the problem we aim to solve.

pip install numpy==1.23.5 matplotlib==3.7.4

import numpy as np


def f(solution: np.ndarray) -> float:
    """The function to minimize.

    :param solution: The solution to the function.
    :return: The value of the function.
    """
    x, y = solution  # Unpack the solution
    return 3 * x ** 2 - 2 * x * y + 3 * y ** 2 + 5 * x - 5 * y


def df(solution: np.ndarray) -> np.ndarray:
    """The derivative of the function.

    :param solution: The solution to the function.
    :return: The gradient of the function.
    """
    x, y = solution  # Unpack the solution
    return np.array([6 * x - 2 * y + 5, -2 * x + 6 * y - 5])

Subsequently, we will proceed to implement the gradient descent method without relying on existing libraries.

from typing import Callable


class GradientDescent:
    """Gradient Descent Method."""

    def __init__(self, f: Callable, df: Callable, alpha: float = 0.01, eps: float = 1e-6) -> None:
        """Initialize the gradient descent method.

        :param f: The function to minimize.
        :param df: The derivative of the function.
        :param alpha: The learning rate.
        :param eps: The convergence criterion.
        """
        self.f = f
        self.df = df
        self.alpha = alpha
        self.eps = eps

        self.solutions = []  # Store the solutions (parameters) at each iteration
        self.answers = []  # Store the value of the function at each iteration
        self.gradients = []  # Store the gradient of the function at each iteration

    def solve(self) -> None:
        """Solve the optimization problem."""
        self.solutions = []  # Empty the solutions
        self.answers = []  # Empty the value of the function
        self.gradients = []  # Empty the gradient of the function

        solution = np.array([1.0, 1.0])  # Initial solution
        answer = self.f(solution)  # Value of the function at the initial solution
        grad = self.df(solution)  # Gradient of the function at the initial solution

        self.solutions.append(solution)
        self.answers = [answer]
        self.gradients.append(grad)

        # Iterate until the gradient is close to zero
        while (grad ** 2).sum() > self.eps ** 2:
            solution = solution - self.alpha * grad  # Update the solution
            answer = self.f(solution)  # Value of the function at the updated solution
            grad = self.df(solution)  # Gradient of the function at the updated solution

            self.solutions.append(solution)
            self.answers.append(answer)
            self.gradients.append(grad)

        self.solutions = np.array(self.solutions)
        self.answers = np.array(self.answers)
        self.gradients = np.array(self.gradients)

With the implementation described above, we can address the optimization problem as follows while also visualizing the optimization process.

problem = GradientDescent(f, df)
problem.solve()

import matplotlib.pyplot as plt


plt.scatter(problem.solutions[0, 0], problem.solutions[0, 1], color="k", marker="o", label="Initial Solution")
plt.plot(problem.solutions[:, 0], problem.solutions[:, 1], color="k", linewidth=1.5)
xs = np.linspace(-2.5, 1.5, 100)
ys = np.linspace(-1.5, 2.5, 100)
xmesh, ymesh = np.meshgrid(xs, ys)
z = np.concatenate([xmesh.reshape(1, -1), ymesh.reshape(1, -1)], axis=0)
levels = [-3, -2.8, -2.6, -2.4, -2.2, -2, -1, 0, 1, 2, 3, 4]
plt.contour(xs, ys, f(z).reshape(xmesh.shape), levels=levels, colors="k", linestyles="dotted")
plt.show()

The problem.solutions attribute will contain the solutions at each iteration, while the problem.answers attribute will contain the value of the function at each iteration. We can visualize the convergence of the gradient descent method using the following code.

fig = plt.figure(figsize=(15, 5))

ax = fig.add_subplot(2, 2, 1)
ax.set_title("Gradient (x)")
ax.set_ylabel("Gradient")
ax.set_xlabel("Iteration")
ax.plot(np.arange(len(problem.gradients)), problem.gradients[:, 0], color="b")

ax = fig.add_subplot(2, 2, 2)
ax.set_title("Gradient (y)")
ax.set_ylabel("Gradient")
ax.set_xlabel("Iteration")
ax.plot(np.arange(len(problem.gradients)), problem.gradients[:, 1], color="r")

ax = fig.add_subplot(2, 2, 3)
ax.set_title("Answer")
ax.set_ylabel("Value")
ax.set_xlabel("Iteration")
ax.plot(np.arange(len(problem.answers)), problem.answers, color="r")

plt.show()