Gradient Descent and Back Propogation
What is Gradient Descent?
Each layer in the neural network computes
> output = relu(dot(W, input) + b)
The updating of W and b is the learning of the network.
Training loop
- Draw a batch of training samples \(x\) and corresponding targets \(y\).
- Run the network on each row of \(x\) (a step called the forward pass) to obtain predictions \(y_{pred}\).
- Compute the loss of the network on the batch, a measure of the mismatch between \(y_{pred}\) and \(y\).
- Update all weights of the network in a way that slightly reduces the loss on this batch (a step called the backward pass)
Training loop
> y_pred = dot(W, x)
> loss_value = loss(y_pred, y)
Define
> loss_value = f(W)
take derivatives or what is called computing the gradient.
Update in the opposite direction of the gradient.
What is Stochastic Gradient Descent (SGD)?
Training loop based on random samples.
What is Stochastic Gradient Descent (SGD)?
- Draw a batch of training samples \(x\) and corresponding targets \(y\). (mini-batches)
- Run the network on each row of \(x\) to obtain predictions \(y_{pred}\). (forward pass)
- Compute the loss of the network on the batch, a measure of the mismatch between \(y_{pred}\) and \(y\).
- Compute the gradient of the loss with regard to the network’s parameters (backward pass).
What is Stochastic Gradient Descent (SGD)?
- Step 5: Move the parameters a little in the opposite direction from the gradient, for example, \(W = W - (step * gradient)\). Thus reducing the loss on the batch a bit.
What is Back Propogation?
If the neural network had three layers
\[f(W1, W2, W3) = a(W1, b(W2, c(W3)))\]
then we can use the Chain Rule from Calculus to take derivatives. The step is the learning rate.