Gradient Descent and Back Propogation

Prof. Eric A. Suess

What is Gradient Descent?

Each layer in the neural network computes

 > output = relu(dot(W, input) + b)

The updating of W and b is the learning of the network.

Draw a batch of training samples \(x\) and corresponding targets \(y\).
Run the network on each row of \(x\) (a step called the forward pass) to obtain predictions \(y_{pred}\).
Compute the loss of the network on the batch, a measure of the mismatch between \(y_{pred}\) and \(y\).
Update all weights of the network in a way that slightly reduces the loss on this batch (a step called the backward pass)

 > y_pred = dot(W, x)
 > loss_value = loss(y_pred, y)

Define

 > loss_value = f(W)

take derivatives or what is called computing the gradient.

Update in the opposite direction of the gradient.

Training loop based on random samples.

Draw a batch of training samples \(x\) and corresponding targets \(y\). (mini-batches)
Run the network on each row of \(x\) to obtain predictions \(y_{pred}\). (forward pass)
Compute the loss of the network on the batch, a measure of the mismatch between \(y_{pred}\) and \(y\).
Compute the gradient of the loss with regard to the network’s parameters (backward pass).

Step 5: Move the parameters a little in the opposite direction from the gradient, for example, \(W = W - (step * gradient)\). Thus reducing the loss on the batch a bit.

If the neural network had three layers

\[f(W1, W2, W3) = a(W1, b(W2, c(W3)))\]

then we can use the Chain Rule from Calculus to take derivatives. The step is the learning rate.

The 3Blue1Brown Neural Networks series is very nice for visualizing SGD and Back Propagation.

The optimizer specifies the exact way in which the gradient of the loss will be used to update parameters.