1.0 | Understanding the Backpropagation Algorithm: A Gentle Introduction
In the world of artificial intelligence and machine learning, neural networks are powerful tools that mimic the human brain to solve complex problems. But how do these networks actually learn? The magic happens through a process called backpropagation.
Backpropagation, short for “backward propagation of errors,” is a method used to train neural networks. It helps the network learn from its mistakes and improve over time. Imagine teaching a child to throw a ball into a basket. Every time the child misses, you give feedback, and the child adjusts their throw. Similarly, backpropagation provides feedback to the neural network, allowing it to adjust its internal parameters and get better at its task.
Backpropagation is essential because it enables neural networks to learn from data and improve their performance over time. Without it, training deep networks would be practically impossible.
Learning
Actually Mean?
In the context of machine learning, learning
refers to the process of adjusting the internal parameters of a model in order to achieve a specific task objective.
2.0 | The Three Steps of Backpropagation
2.1 | Foward Pass
The input data is passed through the network, layer by layer, until it produces an output. This output is then compared to the actual result (the correct answer), and the difference between the two is calculated. This difference is called the error (the loss function, , defines a hypersurface, the objective is to find a satisfactory local minimum, or preferably, even the global).
The loss function contains all internal model paramters. In addition, it contains as many dimensions as internal model paramters. Subsequently, the loss function defines a hypersurface. Each combination of input parameters define an output, in this context, an error. The ideal objective is to find the set of parameters that result in the lowest error. In practice, one aims to find a good enough local minimum. The methodology used to adjust the internal model parameters is described in the backward pass.
2.2 | Backward Pass
The error is then propagated back through the network. This means that the network works backwards (applies the Chain Rule iteratively) from the output layer to the input layer, updating the weights (parameters) of each neuron to minimize the error. The adjustments are made using a method (optimizer), say, gradient descent (there are several better choices, i.e., ADAM), which helps the network find the optimal set of weights to reduce the error.
The derivative, or more generally, gradient, informs us about the direction of steepest ascent. Applying the negative sign on the gradient generates the direction of steepest descent. This aids us in descending in the direction of decreased error values (defined by the loss function). Further enhancements to the gradient descent algorithm may be added inbetween each step. The gradient descent algorithm may be viewed as a special case of many other optimizers. In stochastic gradient descent, gradient descent is just fully-batched stochastic gradient descent. In the same way, it may be viewed as a special case of other more detailed optimization algorithms, such as ADAM (Adaptive Moment Estimation) or Nestov Momemtum. None the less, all of the them aim achieve the same objective, i.e., decrease the error generated by the loss function. In practice, some optimizers have been shown to do this better than others, for example, ADAM.
2.3 | Iteration Pass
This process of forward and backward passes is repeated many times with different batches of data until the network’s performance is satisfactory. The condition of satisfaction might be a lower error bound or a set number of iterations.
3.0 | The Chain Rule
3.1 | Composite functions
The network may be viewed as a nested (composite) set of functions. There is an input vector to the neural network. Then that is feed into a neuron (a linear function) and subsequently a non-linear activation function (say ReLU, Rectified Linear Unit function) is applied. This enables the model to fit non-linear targets. The larger the model, the more complexity can it capture. This procedure of applying functions on top of functions is repeated many times.
The ReLU function is a simple, effective and efficient activation that has grown popular. The basic version of it is defined in the following manner:
- \(f(x)\) = \(\left\{ \begin{array}{ll} x & \text{if } x \geq 0 \\ 0 & \text{if } x < 0 \end{array} \right.\)
or alternatively,
- \(f(x)\) = \(max(x, 0)\).
3.2 | Example 1: Application of the Chain Rule
Given a function \(y\) = \(e^{sin(x^{2})}\), find its derivative.
The function above can be decomposed as the composite of three functions:
- \(y\) = \(f(u)\) = \(e^{u}\),
- \(u\) = \(g(v)\) = \(sin(v)\),
- \(v\) = \(h(x)\) = \(x^{2}\).
The function can then be rewritten as: \(y\) = \(f(g(h(x)))\).
The derivatives of each subfunction is:
- \(\frac{dy}{du}\) = \(f'(u)\) = \(e^{u}\),
- \(\frac{du}{dv}\) = \(g'(u)\) = \(cos(v)\),
- \(\frac{dv}{dx}\) = \(h'(x)\) = \(2x\).
The Chain Rule states that the deriative of this composite function is:
- \(\frac{dy}{dx}\) = \(\frac{dy}{du}\) \(\cdot\) \(\frac{du}{dv}\) \(\cdot\) \(\frac{dv}{dx}\)
The derivative function is therefore:
- \(\frac{dy}{dx}\) = \(e^{u}\) \(\cdot\) \(cos(v)\) \(\cdot\) \(2x\) = \(e^{sin(x^{2})}\) \(\cdot\) \(cos(x^{2})\) \(\cdot\) \(2x\)
Backpropagation
In backpropogation, the precise methodology just described above is applied. The difference is that the composite function is more complicated. Complicated in the sense of that the given composite function consists of many more subfunctions than the three in the rudimentary example above. The great feature in applying the Chain Rule in order to get the gradient of all parameters in all layers and units, is that we do not have to calculate it from scratch at each step. Rather, one usually saves the gradients of previous composite functions, and then just multiplies that result by another gradient in the next layer until the input layer is reached. This is the essence of the backpropagation algorithm. Saving the old gradients up and until the current one decreases the computational needs and inceases the algorithm efficiency by a lot, especially as the models get larger. Once the gradient for a parameter has been found, it can subsequently be updated by the choosen optimizer.
4.0 | Implementation
Using dedicated python deep learning modules such as PyTorch, one can effortlessly create deep neural networks.
First one imports the necessary modules:
import torch
import torch.nn as nn
import torch.nn.functional as F
Subsequently one may with ease define a class for a neural network. The one below contains four input features and three output features. There are two hidden layers, the first one contains eight neurons and the second layer contains nine neurons. The neural network purpose is to model the iris dataset.
# Create a Model Class that inherits nn.Module
class Model(nn.Module):
# Input layer (4 input features for each flower) -->
# Hidden Layer1 (number of neurons) -->
# H2 (n) -->
# output (3 classes of iris flowers)
def __init__(self, in_features=4, h1=8, h2=9, out_features=3):
super().__init__() # instantiate our nn.Module
self.fc1 = nn.Linear(in_features, h1)
self.fc2 = nn.Linear(h1, h2)
self.out = nn.Linear(h2, out_features)
def forward(self, x):
= F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.out(x)
x
return x
4.1 | SoftMax-Loss
For the loss function, I use one called “CrossEntropyLoss” (also known as SoftMax Loss). Measures how far off the predictions are from the true labels. It’s particularly useful for classification problems with multiple categories. The reason for it sometimes being called the SoftMax-Loss, is that it takes as input the value generated by the SoftMax function. The SoftMax function takes a vector of raw scores (logits) and converts them into probabilities.
The \(z_i\) are the raw neural network outputs, i.e., logits. They are converted into raw probabilities \(\hat{y}_i\).
- \(\hat{y}_i = \frac{e^{z_i}}{\sum_{j} e^{z_j}}.\)
It does this by exponentiating each score and then normalizing by the sum of all exponentiated scores. The Cross-Entropy Loss is calculated using the probabilities generated by the SoftMax function and the true labels. Take the negative logarithm of the predicted probability corresponding to the true class. Since we use one-hot encoding for true labels (where only the true class has a value of 1 and all others have 0), this step effectively selects the log-probability of the true class.
The SoftMax-Loss (CrossEntropy Loss) measures the difference between the true labels \(y_i\) (one-hot encoded) and the predicted probabilities \(\hat{y}_i\).
- \(\text{SoftMax-Loss} = -\sum_{i} y_i \log(\hat{y}_i).\)
Observe that the logarithmic function generates a huge loss (potentially infinite) for parameters with a small probability and a loss of zero for a correct prediction.
# Set the criterion of model to measure the error, how far off the predictions are from the data
= nn.CrossEntropyLoss() criterion
4.2 | Adam Optimizer
For the optimizer, I used the Adam optimizer, which combines the benefits of AdaGrad and RMSProp to enhance the training of neural networks. Here’s how it works:
- Initialization:
- Adam initializes two sets of values: one for the moving average of gradients (momentum) and one for the moving average of squared gradients (adaptive learning rate), both starting at zero.
- Momentum Term:
- Incorporates a momentum term to average out gradients over time, allowing larger initial steps to accelerate convergence towards optimal parameters.
- Adaptive Learning Rate:
- Adjusts the learning rate for each parameter based on the average of squared gradients. Decreases the learning rate for large updates to prevent overshooting and increases it for small updates to ensure progress.
- Bias Correction:
- Applies a correction term to address initial bias in the moving averages, ensuring accurate and stable optimization from the start.
- Flat Regions and Sharp Gradients:
- Takes larger steps in flat regions to escape poor local minima and smaller steps in regions with sharp gradients to make careful adjustments.
- Parameter Update:
- Updates parameters using the corrected moving averages, balancing exploration and exploitation for faster and more reliable convergence.
Overall, the Adam optimizer effectively adapts to the loss landscape, making it a powerful tool for a wide range of machine learning tasks.
# Choose Adam Optimizer, lr = learning rate (if error doesn't go down after a bunch of iterations (epochs), lower our learning rate)
= torch.optim.Adam(model.parameters(), lr=0.01) optimizer
5.0 | Training the Neural Network
Lastly, I train the Neural Network, i.e., find an optimal set of parameters.
= 100
epochs = []
losses for i in range(epochs):
# Go forward and get a prediction
= model.forward(X_train) # Get predicted results
y_pred
# Measure the loss/error, gonna be high at first
= criterion(y_pred, y_train) # predicted values vs the y_train
loss
# Keep Track of our losses
losses.append(loss.detach().numpy())
# print every 10 epoch
if i % 10 == 0:
print(f'Epoch: {i} and loss: {loss}')
# Do some back propagation: take the error rate of forward propagation and feed it back through the network to fine tune the weights
optimizer.zero_grad()
loss.backward() optimizer.step()
Here the epochs
variable is the number of times the entire training dataset will pass through the model. Here, it is set to 100. losses
is an empty list to store the loss values for each epoch, this enables me to track how the loss changes over time. The loop is set to run as many times as there are epochs. model.forward(X_train)
calculates the predicted values (y_pred
) based on the input features (X_train
). This is the forward pass of the neural network.
criterion
is the loss function used to measure the difference between the predicted values (y_pred
) and the actual target values (y_train
). The loss value quantifies how well or poorly the model is performing. loss.detach().numpy()
converts the loss from a PyTorch tensor to a NumPy array and adds it to the losses list. This is so as to keep track of the loss value for each epoch.
Every 10 epochs, the code prints the current epoch number and the loss value. This helps monitor the training process and see how the loss is decreasing over time.
optimizer.zero_grad()
clears the gradients of all optimized parameters. Gradients are accumulated by loss.backward()
and need to be reset to zero before the next backward pass. loss.backward()
computes the gradient of the loss with respect to the model parameters (weights and biases) using backpropagation. optimizer.step()
updates the model parameters based on the computed gradients. This step adjusts the weights to minimize the loss. The objective is to observe the error decreasing every 10 epochs. Empirically this was also the case, as highlighted below.
Epoch: 10 and loss: 1.125203251838684
Epoch: 20 and loss: 1.0097211599349976
Epoch: 30 and loss: 0.816234827041626
Epoch: 40 and loss: 0.585993230342865
Epoch: 50 and loss: 0.4003390371799469
Epoch: 60 and loss: 0.2679472267627716
Epoch: 70 and loss: 0.17963498830795288
Epoch: 80 and loss: 0.12165627628564835
Epoch: 90 and loss: 0.08606518805027008 Epoch: 100 and loss: 0.06522618979215622
The loss as a function of the current epoch is visually demonstrated below.
6.0 | Final Remarks
Backpropagation might sound complex at first, but it’s the core reason neural networks can learn from data. As a student, understanding backpropagation is like unlocking a secret code to making AI models smarter. It starts with the forward pass, where the model makes predictions. Then, in the backward pass, the model learns from its mistakes by adjusting its parameters to reduce the error. Tools like PyTorch simplify this process, allowing us to focus on designing and improving our models rather than getting bogged down in the math. By using optimizers like Adam, we can make our models learn faster and more efficiently. Through consistent practice and experimentation, we can see how these concepts work together to improve our models’ performance, making backpropagation an essential part of our AI toolkit.