Pioneers of Sequential Data Processing Modeling: RNN and LSTMs
A Recurrent Neural Network ( or RNN) is a neural network that processes sequential data step by step, using the concept of a hidden state. This hidden state captures information from previous steps and can be viewed as a variant of a Markov model.
At each timestep , the RNN takes two types of input: input is a current data point while hidden state is input from the previous step.
The output also can be two types which are a new hidden state and prediction .
are weight matrices.
represents bias terms.
When indicates the initial state of the model, an initial vector ( of random or zero numbers) is used to fill in .
This basic architecture, known as a "vanilla RNN" (introduced by John Hopfield, 1982), suffers from problems such as vanishing/exploding gradients and short-term memory limitations. To address these issues, researchers have developed various RNN variants.
FRNN (or Fully Recurrent Neural Network) uses a "fused input" that concatenates input and hidden states to capture all layer states. This method reduces GPU kernel launches and memory bandwidth usage.
LSTM, Long Short-Term Memory Network
LSTMs were introduced (by Hochreiter and Schmidbhuber, 1997) to solve RNNs vanishing gradient problem. They use gating mechanisms to control information flow.
Hidden State is calculated by:
is called Cell state update, which passes through the previous forget gate:
which is called Candidate Memory.
Forget Gate determines what to remove from memory:
Input Gate determines what new information to store:
Output Gate determines what to output based on memory:
LSTMs face significant challenges in capturing sequential data effectively. Their fixed architecture may not adapt well to various sequential dependencies, particularly long-range hierarchical patterns. They also lack scalability—adding new gates requires increasing parameters, which doesn't necessarily improve performance.
To address these limitations, modern AI systems have been developed more flexible architectures:
NTMs (Neural Turing Machines) and DNCs (Differentiable Neural Computers) use external memory banks with soft attention for read/write operations. However, their high computational costs make them impractical for most modern applications.
The Attention Mechanism and Transformer architecture have superseded recurrence by implementing self-attention, which better captures long-range dependencies without relying on fixed gates.
It employs a different type of "gating" called a Gating Network, also known as Mixture of Experts. This differs from LSTM gating—while LSTMs gate memory, this system gates which network the model should use.
Highway Networks and Adaptive Gating uses learnable gating function to control information flow across layers which is more flexible than LSTM’s fixed gates.
LTC, Liquid Time-Constant Networks use "neural ODEs" to model continuous-time dynamics, which enables input-dependent gating.
Memory-Augmented Networks (MANNs) combine Transformer architecture with dynamic memory slots for explicit pattern storage.
However, the concepts of "learnable gating" and “learnable long-term memory” for sequential data remain relevant in some transformer architectures, which also process sequential data.
Last updated