--- title: "Gradient Descent and Back Propogation" author: "Prof. Eric A. Suess" format: revealjs --- ## What is Gradient Descent? Each layer in the neural network computes > output = relu(dot(W, input) + b) The updating of *W* and *b* is the learning of the network. ## Training loop 1. Draw a batch of training samples x and corresponding targets y. 2. Run the network on x (a step called the *forward pass*) to obtain predictions y_pred. 3. Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y. 4. Update all weights of the network in a way that slightly reduces the loss on this batch (a step called the *backward pass*) ## Training loop > y_pred = dot(W, x) > loss_value = loss(y_pred, y) Define > loss_value = f(W) take derivatives or what is called *computing the gradient*. Update in the opposite direction of the gradient. ## What is Stochastic Gradient Descent (SGD)? Training loop based on random samples. - Steps ## What is Stochastic Gradient Descent (SGD)? 1. Draw a batch of training samples x and corresponding targets y. (mini-batches) 2. Run the network on x to obtain predictions y_pred. 3. Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y. 4. Compute the gradient of the loss with regard to the network’s parameters (a backward pass). 5. Move the parameters a little in the opposite direction from the gradient, for example, $W = W - (step * gradient)$. ## What is Stochastic Gradient Descent (SGD)? - Step 5: Thus reducing the loss on the batch a bit. ## What is Back Propogation? If the neural network had three layers $$f(W1, W2, W3) = a(W1, b(W2, c(W3)))$$ then we can use the Chain Rule from Calculus to take derivatives. ## 3Blue1Brown The [3Blue1Brown](https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw) [Neural Networks](https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) series is very nice for visualizing SGD and Back Propagation. ## Optimizer The optimizer specifies the exact way in which the gradient of the loss will be used to update parameters.