Optimising linear regression: A guide to applying gradient descent

exactful

Daniel Cooper

Posted on February 26, 2023

Optimising linear regression: A guide to applying gradient descent

Linear regression is a statistical technique that helps us understand and predict relationships between variables. It is widely used in data science and machine learning.

At its core, linear regression involves finding the best-fitting line through a set of data points, allowing us to make predictions about future values based on past observations.

Linear regression

Finding the line can be a complex and computationally-intensive task, especially when dealing with large datasets.

Enter gradient descent... an optimisation algorithm that we can use to find the best-fitting line by iteratively adjusting its parameters.

It's worth noting that linear regression is a form of supervised learning where our algorithm is trained on "labelled" data. This simply reflects the fact that, within the available data, the inputs have corresponding outputs or labels.

In this blog, we will look at the maths, specifically the partial derivatives, involved in applying gradient descent to linear regression.

Model

Given a training set, our aim is to model the relationship so that our hypothesis y^\hat{y} is a good predicator for the corresponding value of yy .

For simple or univariate linear regression, our model can be expressed as:

y^=θ0+θ1x1 \hat{y} = \theta_0 + \theta_1{x_1}

And for multivariate linear regression:

y^=θ0+θ1x1+θ2x2+...+θjxj \hat{y} = \theta_0 + \theta_1{x_1} + \theta_2{x_2} + ... + \theta_j{x_j}

Loss function

We can measure the accuracy of our hypothesis y^\hat{y} by calculating the mean squared error - including a 12\frac{1}{2} to simplify the derivatives that follow later.

Loss,l=12mi=1m(y^y)2 {Loss, l} = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y} - y)^2

Mean squared error

Gradient descent

We need to minimise the loss to ensure that y^\hat{y} is a good predicator of yy .

Gradient descent is well described elsewhere but the principle is that we adjust the θ\theta parameters until the loss is at the very bottom of the curve.

Gradient descent

The adjustment is achieved by repeatedly taking the derivative of the loss function (the tangent line) to infer a movement to the right (gradient is negative) or the left (gradient is positive).

The size of each step is relative to a fixed hyper parameter α\alpha which is called the learning rate. Typical values for α\alpha are 0.1, 0.01 and 0.001.

This approach can be articulated as:

θj=θjαdldθj \theta_j = \theta_j - \alpha\dfrac{\mathrm{d}l}{\mathrm{d}\theta_j}

There are three types of gradient descent:

  1. Batch gradient descent: Update the θ\theta parameters using the entire training set; slow but guaranteed to converge to the global minimum
  2. Stochastic gradient descent: Update the θ\theta parameters using each training example; faster but may not converge to the global minimum due to randomness
  3. Mini batch gradient descent: Update the θ\theta parameters using a small batch of training examples; compromise between efficiency and convergence

Gradient descent derivatives

Imagine we have the following training data:

x1 x2 y
5 6 31.7
10 12 63.2
15 18 94.7
20 24 126.2
... ... ...
... ... ...
35 42 220.7

Our model will be:

y^=θ0+θ1x1+θ2x2 \hat{y} = \theta_0 + \theta_1{x_1} + \theta_2{x_2}

Let's say we want to apply stochastic gradient descent.

Here are the steps we need to follow:

  1. Set α\alpha to 0.001 (say) and θ0\theta_0 , θ1\theta_1 and θ2\theta_2 to random values between -1 and 1
  2. Take the first training example and use gradient descent to update θ0\theta_0 , θ1\theta_1 and θ2\theta_2
  3. Repeat step 2 for each of the remaining training examples
  4. Repeat the process from step 2 another 1000 times

Let's look at the expressions and derivatives needed, starting with the loss.

When we consider just one training example at a time, as with stochastic gradient descent, m=1m = 1 .

As such, the loss between our hypothesis y^\hat{y} and yy can be described as:

Loss,l=12(y^y)2 {Loss, l} = \frac{1}{2}(\hat{y} - y)^2

Our gradient descent equation is:

θj=θjαdldθj \theta_j = \theta_j - \alpha\dfrac{\mathrm{d}l}{\mathrm{d}\theta_j}

To apply it, we need to determine:

dldθj \dfrac{\mathrm{d}l}{\mathrm{d}\theta_j}

Using the chain rule, we can say that:

dldθj=dld(y^y).d(y^y)dθj \dfrac{\mathrm{d}l}{\mathrm{d}\theta_j} = \dfrac{\mathrm{d}l}{\mathrm{d}(\hat{y} - y)} . \dfrac{\mathrm{d}(\hat{y} - y)}{\mathrm{d}\theta_j}

One of the linearity rules of derivatives tells us that when differentiating the addition or subtraction of two functions, we can differentiate the functions individually and then add or subtract them.

This means we can say:

dldθj=dld(y^y).[dy^dθjdydθj] \dfrac{\mathrm{d}l}{\mathrm{d}\theta_j} = \dfrac{\mathrm{d}l}{\mathrm{d}(\hat{y} - y)} . \left[ \dfrac{\mathrm{d}\hat{y}}{\mathrm{d}\theta_j}- \dfrac{\mathrm{d}y}{\mathrm{d}\theta_j} \right]

Since yy does not change with respect to θ\theta :

dydθj=0 \dfrac{\mathrm{d}y}{\mathrm{d}\theta_j} = 0

Therefore:

dldθj=dld(y^y).dy^dθj \dfrac{\mathrm{d}l}{\mathrm{d}\theta_j} = \dfrac{\mathrm{d}l}{\mathrm{d}(\hat{y} - y)} . \dfrac{\mathrm{d}\hat{y}}{\mathrm{d}\theta_j}

Let's now consider the two parts to this derivative, starting with the first:

dld(y^y) \dfrac{\mathrm{d}l}{\mathrm{d}(\hat{y} - y)}

We can reduce this down:

dld(y^y)=2.12(y^y)=(y^y) \dfrac{\mathrm{d}l}{\mathrm{d}(\hat{y} - y)} = 2.\frac{1}{2}(\hat{y} - y) = (\hat{y} - y)

Now let's consider the second part of the derivative:

dy^dθj \dfrac{\mathrm{d}\hat{y}}{\mathrm{d}\theta_j}

We can this simplify this for θ0\theta_0 , θ1\theta_1 and θ2\theta_2 as follows:

dy^dθ0=d(θ0+θ1x1+θ2x2)dθ0=1 \dfrac{\mathrm{d}\hat{y}}{\mathrm{d}\theta_0} = \dfrac{\mathrm{d}(\theta_0 + \theta_1{x_1} + \theta_2{x_2})}{\mathrm{d}\theta_0} = 1
dy^dθ1=d(θ0+θ1x1+θ2x2)dθ1=x1 \dfrac{\mathrm{d}\hat{y}}{\mathrm{d}\theta_1} = \dfrac{\mathrm{d}(\theta_0 + \theta_1{x_1} + \theta_2{x_2})}{\mathrm{d}\theta_1} = x_1
dy^dθ2=d(θ0+θ1x1+θ2x2)dθ2=x2 \dfrac{\mathrm{d}\hat{y}}{\mathrm{d}\theta_2} = \dfrac{\mathrm{d}(\theta_0 + \theta_1{x_1} + \theta_2{x_2})}{\mathrm{d}\theta_2} = x_2

Let's now combine what we've worked out.

We found that:

θj=θjαdldθj \theta_j = \theta_j - \alpha\dfrac{\mathrm{d}l}{\mathrm{d}\theta_j}

Where:

dldθj=dld(y^y).dy^dθj \dfrac{\mathrm{d}l}{\mathrm{d}\theta_j} = \dfrac{\mathrm{d}l}{\mathrm{d}(\hat{y} - y)} . \dfrac{\mathrm{d}\hat{y}}{\mathrm{d}\theta_j}

And:

dld(y^y)=(y^y) \dfrac{\mathrm{d}l}{\mathrm{d}(\hat{y} - y)} = (\hat{y} - y)

Therefore we can say:

θ0=θ0αdldθ0=θ0α(y^y) \theta_0 = \theta_0 - \alpha\dfrac{\mathrm{d}l}{\mathrm{d}\theta_0} = \theta_0 - \alpha(\hat{y} - y)
θ1=θ1αdldθ1=θ1α(y^y)x1 \theta_1 = \theta_1 - \alpha\dfrac{\mathrm{d}l}{\mathrm{d}\theta_1} = \theta_1 - \alpha(\hat{y} - y)x_1
θ2=θ2αdldθ2=θ2α(y^y)x2 \theta_2 = \theta_2 - \alpha\dfrac{\mathrm{d}l}{\mathrm{d}\theta_2} = \theta_2 - \alpha(\hat{y} - y)x_2

We now have all of the expressions needed to apply gradient descent. Over 1000 iterations, we might expect to see the total loss per iteration decrease like this.

Total loss per iteration

In this post, we looked at the maths, specifically the partial derivatives, involved in applying gradient descent to linear regression.

For more information, check out Andrew Ng's Stanford CS229 lecture on Linear Regression and Gradient Descent.

💖 💪 🙅 🚩
exactful
Daniel Cooper

Posted on February 26, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related