Understanding the mathematics behind backpropagation requires a solid foundation in calculus and linear algebra. Here's a breakdown of the key concepts:
1. Forward Propagation:
- This phase involves passing the input data through the network layer by layer. Each layer performs a weighted sum of the previous layer's outputs and applies a non-linear activation function like sigmoid or ReLU.
- The mathematical representation involves matrices and vectors:
- Inputs: Represented by a vector
x
. - Weights: Represented by a matrix
W
. - Biases: Represented by a vector
b
. - Activations: Represented by a vector
h
. - Output: Represented by a vector
y
.
- Inputs: Represented by a vector
- The forward pass equation for a single layer is:
h = f(Wx + b)
, wheref
is the activation function.
2. Loss Function:
- This function measures the difference between the network's predicted output and the actual target value.
- Commonly used loss functions include:
- Mean Squared Error (MSE):
MSE = (y - t)^2
, wheret
is the target value. - Cross-Entropy:
Cross-Entropy = -t log(y) - (1-t) log(1-y)
.
- Mean Squared Error (MSE):
3. Backpropagation:
- This phase calculates the gradient of the loss function with respect to each weight and bias in the network.
- The gradient tells us how much each weight and bias contributes to the overall error and guides us in adjusting them to minimize the loss.
- Backpropagation is an iterative process that involves:
- Calculating the output error:
δ = (y - t)
for the output layer. - Propagating the error back through the network:
δ = W^T δ f'(z)
for hidden layers, wherez
is the weighted sum of inputs andf
' is the derivative of the activation function. - Updating the weights and biases:
ΔW = -ηδh^T
, whereη
is the learning rate.
- Calculating the output error:
4. Gradient Descent:
- This optimization algorithm uses the calculated gradients to update the weights and biases in the direction that minimizes the loss function.
- Different variants of gradient descent exist, each with its own advantages and disadvantages.
For further explanation and practical examples you can read the book.