1.0 | Introduction & History
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) architecture that was introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1997. The primary motivation behind LSTM was to address the limitations of traditional RNNs, particularly the vanishing gradient problem. This problem made it difficult for RNNs to learn and retain information over long sequences of data.
LSTMs introduced memory cells and gating mechanisms, which allow the network to selectively remember or forget information, making them especially powerful for tasks that involve sequential data, like time series prediction, natural language processing, and more.
2.0 | Terminology & Overview
Before diving into the details, let’s clarify some key terms associated with LSTMs:
- Cell State: The memory of the network, which carries information across time steps.
- Hidden State: The output of the cell, which is passed to the next time step.
- Gates: Mechanisms that control the flow of information in and out of the memory cell. There are three types of gates: forget gate, input gate, and output gate.
In an LSTM, information is processed through a series of these gates, which decide what information should be kept, updated, or discarded.
3.0 | Cell State
The cell state is the core concept of LSTMs. It acts as a conveyor belt that runs through the entire sequence of data. The cell state is modified by adding or removing information, which is regulated by the gates. Because of this design, LSTMs are able to maintain and propagate relevant information over long sequences, solving the long-term dependency problem that traditional RNNs struggle with.
4.0 | Gates
4.1 | Forget Gate
The forget gate is responsible for deciding what information should be discarded from the cell state. It takes the hidden state from the previous time step and the current input and passes them through a sigmoid activation function.
Why Sigmoid? The sigmoid function outputs a value between 0 and 1, which can be interpreted as a probability or a soft gating mechanism. A value close to 0 means the gate will “forget” most of the information, while a value close to 1 means it will retain most of it. The sigmoid function allows the network to decide how much information should be forgotten in a smooth and differentiable manner, which is crucial for backpropagation and learning.
Mathematically, it can be represented as: \[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]
- \(f_t\) is the forget gate’s output,
- \(W_f\) and \(b_f\) are the weight matrix and bias for the forget gate,
- \(\sigma\) is the sigmoid function,
- \(h_{t-1}\) is the hidden state from the previous time step,
- \(x_t\) is the input at the current time step.
4.2 | Input Gate
The input gate controls what new information should be added to the cell state. It involves two components: a sigmoid function that determines which candidate values will be incorporated, and a tanh
function that generates these candidate values.
Generate Candidate Values: The candidate values, \(\tilde{C}_t\), are potential new information that could be added to the cell state. These values are computed by applying the tanh
function to a linear combination of the current input \(x_t\) and the previous hidden state \(h_{t-1}\).
Why Tanh? The tanh
function scales the candidate values between -1 and 1, allowing the network to consider both positive and negative updates to the cell state. This scaling helps in controlling the magnitude of the new information, ensuring that updates are balanced and not excessively large.
\[ \tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \]
Filter with Input Gate: The input gate layer uses a sigmoid function to determine how much of each candidate value should be added to the cell state. This function produces values between 0 and 1, which act as weights for the candidate values.
Why Sigmoid? The sigmoid function in the input gate outputs values that regulate the extent to which each candidate value is incorporated into the cell state. A value close to 1 means the candidate value will be strongly considered, while a value close to 0 means it will have little to no impact. This gating mechanism ensures that only relevant new information is added to the cell state.
\[ i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \]
Update Cell State: The final cell state \(C_t\) is updated by combining the previous cell state \(C_{t-1}\) (after applying the forget gate) with the new candidate values \(\tilde{C}_t\) (scaled by the input gate). The update equation is:
\[ C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t \]
Here, \(f_t\) is the forget gate’s output, which decides how much of the old cell state should be retained, and \(i_t\) (from the input gate) scales the candidate values to determine their contribution to the updated cell state.
- \(i_t\) is the input gate’s output, controlling how much of each candidate value is added to the cell state,
- \(\tilde{C}_t\) represents the new candidate values generated by
tanh
, - \(W_i\), \(W_C\), \(b_i\), and \(b_C\) are the weight matrices and biases for the input gate and candidate value generation.
4.3 | Output Gate
Compute the Output Gate: The output gate uses a sigmoid function to determine how much of the cell state should influence the new hidden state. The output is a gating value between 0 and 1.
\[ o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \]
Calculate the New Hidden State: The new hidden state \(h_t\) is calculated by applying the tanh
function to the updated cell state \(C_t\) and then scaling it by the output gate \(o_t\). This scaling controls how much of the cell state affects the hidden state.
\[ h_t = o_t \cdot \tanh(C_t) \]
- \(o_t\) is the output gate’s value, determining the influence of the cell state on the hidden state,
- \(C_t\) is the updated cell state,
- \(\sigma\) is the sigmoid function,
- \(W_o\) and \(b_o\) are the weight matrix and bias for the output gate.
4.4 | LSTM: Step-by-Step
To summarize the LSTM operation, here’s a step-by-step overview:
Step 1: The LSTM receives the input vector \(x_t\) and the previous state \((h_{t-1}, c_{t-1})\).
Step 2: The forget gate \(f_t\) decides what information to discard from the cell state \(c_{t-1}\) using:
\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \]
Step 3: The input gate \(i_t\) decides what new information to add. It consists of: - A sigmoid layer that determines which values to update:
\[ i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \]
- A tanh layer that generates candidate values \(\tilde{C}_t\):
\[ \tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \]
Step 4: Update the cell state \(c_{t-1}\) to the new cell state \(c_t\):
\[ c_t = f_t \cdot c_{t-1} + i_t \cdot \tilde{C}_t \]
Step 5: Compute the output \(h_t\) based on the cell state \(c_t\). First, apply the output gate \(o_t\) to filter the cell state:
\[ o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \]
Then, compute the new hidden state:
\[ h_t = o_t \cdot \tanh(c_t) \]
5.0 | Use Cases
LSTMs have a wide range of applications, especially in areas where sequential data is prevalent:
- Natural Language Processing (NLP): LSTMs are used in tasks like language modeling, text generation, translation, and sentiment analysis.
- Time Series Prediction: LSTMs can predict future values in a sequence of time-dependent data, such as stock prices or weather forecasting.
- Speech Recognition: LSTMs help in recognizing and generating speech patterns.
- Anomaly Detection: LSTMs are used to detect unusual patterns in data sequences, which is useful in areas like fraud detection.
6.0 | Final Remarks
LSTMs represent a significant advancement in neural network architecture by enabling models to capture long-term dependencies in sequential data. Their introduction solved the vanishing gradient problem, which plagued earlier RNNs, and opened up new possibilities in fields requiring the understanding and generation of sequences. While newer models like Transformers have gained popularity, LSTMs remain a powerful tool for many applications.
In conclusion, understanding the internal workings of LSTMs is crucial for anyone looking to work with sequential data, and their versatility ensures that they will continue to be a valuable resource in
References
- S Hochreiter (1997). Long Short-Term Memory.