Pioneers of Sequential Data Processing Modeling: RNN and LSTMs

A Recurrent Neural Network ( or RNN) is a neural network that processes sequential data step by step, using the concept of a hidden state. This hidden state captures information from previous steps and can be viewed as a variant of a Markov model.

At each timestep $t$ , the RNN takes two types of input: input $x_t$ is a current data point while hidden state $h_{t - 1}$ is input from the previous step.
The output also can be two types which are a new hidden state $h_t$ and prediction $y_t$ .

h_t = \sigma{(W_{xh}\cdot x_t + W_{hh}\cdot h_{t - 1} + b_h)}\newline y_t = W_{hy}h_t + b_y

$W_{xh}, W_{hh}, W_{hy}$ are weight matrices.
$b_h, b_y$ represents bias terms.
When $h_{t - 1} = -1$ indicates the initial state of the model, an initial vector $\text{IV}$ ( of random or zero numbers) is used to fill in $h_{t-1}$ .

This basic architecture, known as a "vanilla RNN" (introduced by John Hopfield, 1982), suffers from problems such as vanishing/exploding gradients and short-term memory limitations. To address these issues, researchers have developed various RNN variants.

y_t = \sigma(W_{y}h_{t} + U_{y}x_{t}^{\text{fused}})

FRNN (or Fully Recurrent Neural Network) uses a "fused input" that concatenates input and hidden states to capture all layer states. This method reduces GPU kernel launches and memory bandwidth usage.

They were the first artificial neural networks capable of processing variable-length sequences and capturing temporal dependencies. Until 2017, they dominated machine translation, text generation, and time-series forecasting. However, after being superseded by Transformers, they now remain a legacy architecture.

LSTM, Long Short-Term Memory Network

LSTMs were introduced (by Hochreiter and Schmidbhuber, 1997) to solve RNNs vanishing gradient problem. They use gating mechanisms to control information flow.

Hidden State $h_t$ is calculated by: $h_t = o_t \circ \tanh(C_t)$
- $C_t$ is called Cell state update, which passes through the previous forget gate: $C_t = f_t \circ C_{t-1} + i_t \circ \bar{C_t}$
- $\bar{C_t} = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$ which is called Candidate Memory.
Forget Gate determines what to remove from memory: $f_t = \sigma (W_f \cdot [h_{t01}, x_t] + b_f)$
Input Gate determines what new information to store: $i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i )$
Output Gate determines what to output based on memory: $o_t = \sigma(W_O \cdot [h_{t-1}, x_t] + b_O)$

LSTMs face significant challenges in capturing sequential data effectively. Their fixed architecture may not adapt well to various sequential dependencies, particularly long-range hierarchical patterns. They also lack scalability—adding new gates requires increasing parameters, which doesn't necessarily improve performance.

To address these limitations, modern AI systems have been developed more flexible architectures:

NTMs (Neural Turing Machines) and DNCs (Differentiable Neural Computers) use external memory banks with soft attention for read/write operations. However, their high computational costs make them impractical for most modern applications.
The Attention Mechanism and Transformer architecture have superseded recurrence by implementing self-attention, which better captures long-range dependencies without relying on fixed gates.
- It employs a different type of "gating" called a Gating Network, also known as Mixture of Experts. This differs from LSTM gating—while LSTMs gate memory, this system gates which network the model should use.
Highway Networks and Adaptive Gating uses learnable gating function to control information flow across layers which is more flexible than LSTM’s fixed gates.
LTC, Liquid Time-Constant Networks use "neural ODEs" to model continuous-time dynamics, which enables input-dependent gating.
Memory-Augmented Networks (MANNs) combine Transformer architecture with dynamic memory slots for explicit pattern storage.

However, the concepts of "learnable gating" and “learnable long-term memory” for sequential data remain relevant in some transformer architectures, which also process sequential data.

Not only LSTMs and RNNs, but almost all sequential data processing models heavily rely on solutions such as batch/layer normalization, weight initialization, and gradient clipping. This makes their architecture overly fixed and not scalable across different data types.

PreviousTokenization and Stemming, Lemmatization, Stop-word Removal: Foundational Works of NLP NextAttention Is All You Need

Last updated 3 months ago