Gradient Boost for Regression

xsabzal

Abzal Seitkaziyev

Posted on December 28, 2020

Gradient Boost for Regression

Photo by Bill Jelen on Unsplash

Gradient Tree Boosting

Gradient Tree Boosting is an ensemble algorithm that could be applied to both classification and regression problems. Here I will describe how gradient boost works for regression. Gradient Boost uses Decision Trees as weak learners, and each Decision Tree predicts pseudo-residual values, and all Decision Trees have the same 'weight' in the final decision (described by the learning rate).

There are a few key components in the Gradient Boosting algorithm:

a) Loss function - the natural choice for regression is 'Least Squares'
(Note: similar to the linear regression but commonly used with coefficient 1/2*(True-Predicted)^2, to avoid pseudo-residuals and computing actual residuals = (True-Predicted).

b) Hyperparameters:
learning rate (used to scale each weak learner prediction), and parameters related to the weak learners themselves (e.g. number of weak learners, maximum depth of each tree, etc.)

Algorithm steps

Let's dive into the details of the algorithm itself. This is the mathematical description of the Boosted gradient Trees.

Alt Text
Source is here.

Below is the simplified explanation of the above steps when using 'Least Squares' Loss Function L(y,F(x)):

1) Get the initial prediction, it is equal to the mean of the target column.
2) m is the number of weak learners. So we do the below steps for each decision tree (e.g. m=1 to m=100, when n_estimators=100):

a. Compute residuals (True-Predicted) for each tree iteratively (meaning previous residuals used as a target for the next decision tree).
note: for the first tree, Residual = True - Target_Mean (Predicted in the previous step)
b. Fit decision tree to the residuals
c. Compute output value for each leaf in the tree (in this case = mean of the residuals in this leaf)
d. Update the predicted values; new_prediction = previous_prediction + learning_rate * output_value.
e. Repeat steps 2a to 2e till all weak learners constructed.

3) Compute the final prediction F(x)

Summary

From the short description above we can see that there are some similarities with AdaBoost (like iterativeness - next trees depends on the previous predictions), and differences(each tree has the same learning rate vs different weights in AdaBoost).

References:

  1. Video material
  2. scikit-learn
  3. Wikipedia
💖 💪 🙅 🚩
xsabzal
Abzal Seitkaziyev

Posted on December 28, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related