Gradient Descent VS Gradient Boosting
Consider analogy of the blindfolded hiker trying to walk down a mountain into a valley (where the bottom of the valley is zero error).
Both Neural Networks and Gradient Boosting are trying to walk down the exact same mountain using the exact same compass (calculus/gradients). The difference is entirely in how they take a step.
Here is the breakdown of Parameter Space versus Function Space.
1. Descending in "Parameter Space" (Neural Networks)
In a Neural Network, the architecture of the model is fixed. You build a network with a set number of layers and a set number of neurons. That structure does not change.
Inside those neurons are millions of numbers called weights and biases. These are the parameters. Think of parameters as millions of tiny dials and knobs on a massive soundboard.
How it descends:
- The model makes a prediction.
- It calculates the error (the gradient) to figure out which direction is downhill.
- To take a step toward the bottom of the valley, the network reaches inside itself and twists the dials. It slightly changes the numbers of its weights.
- The structure remains identical, but the internal numbers (parameters) have shifted.
Because the model is traveling down the mountain strictly by adjusting these internal numbers, we say it is descending in Parameter Space.
2. Descending in "Function Space" (Gradient Boosting)
In Gradient Boosting, things work completely differently.
When you build a decision tree (like our rule "If Experience
How it descends:
If you can't twist any dials, how do you fix your errors to walk down the mountain?
- The model makes a base prediction.
- It calculates the error (the gradient) to figure out which direction is downhill.
- To take a step, it adds a completely new tree to the model.
Instead of tweaking existing numbers, XGBoost leaves the old trees exactly as they are and simply tacks a brand new math equation (Tree 2) onto the end of the line. Then it adds Tree 3, Tree 4, and so on.
Because the model is traveling down the mountain by adding entirely new mathematical formulas (trees) rather than adjusting internal numbers, we say it is descending in Function Space.
3. The Mathematical Equivalence
Here is where the magic happens and why they are mathematically equivalent.
In standard calculus (like in a neural network), if you want to lower your error, you calculate your gradient (the slope) and you subtract it from your current weights. You always move in the opposite direction of the gradient to go downhill.
In Gradient Boosting, remember how we calculated the "Residual" by doing Actual minus Predicted?
In calculus terms, that residual is mathematically identical to the negative gradient. By building a new tree that specifically predicts the residual, and adding that tree to our model, we are literally doing the exact same math as a neural network subtracting its gradient!
- Neural Network Update:
- XGBoost Update:
They are the exact same mechanism in disguise. One turns a dial to subtract the error; the other builds a wooden tree to subtract the error.