Gradient Descent and Back Propogation
What is Gradient Descent?
Each layer in the neural network computes
> output = relu(dot(W, input) + b)
The updating of W and b is the learning of the network.
Training loop
- Draw a batch of training samples x and corresponding targets y.
- Run the network on x (a step called the forward pass) to obtain predictions y_pred.
- Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y.
- Update all weights of the network in a way that slightly reduces the loss on this batch (a step called the backward pass)
Training loop
> y_pred = dot(W, x)
> loss_value = loss(y_pred, y)
Define
> loss_value = f(W)
take derivatives or what is called computing the gradient.
Update in the opposite direction of the gradient.
What is Stochastic Gradient Descent (SGD)?
Training loop based on random samples.
What is Stochastic Gradient Descent (SGD)?
- Draw a batch of training samples x and corresponding targets y. (mini-batches)
- Run the network on x to obtain predictions y_pred.
- Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y.
- Compute the gradient of the loss with regard to the network’s parameters (a backward pass).
- Move the parameters a little in the opposite direction from the gradient, for example, \(W = W - (step * gradient)\).
What is Stochastic Gradient Descent (SGD)?
- Step 5: Thus reducing the loss on the batch a bit.
What is Back Propogation?
If the neural network had three layers
\[f(W1, W2, W3) = a(W1, b(W2, c(W3)))\]
then we can use the Chain Rule from Calculus to take derivatives.
Optimizer
The optimizer specifies the exact way in which the gradient of the loss will be used to update parameters.