Pioneers of Sequential Data Processing Modeling: RNN and LSTMs

A Recurrent Neural Network ( or RNN) is a neural network that processes sequential data step by step, using the concept of a hidden state. This hidden state captures information from previous steps and can be viewed as a variant of a Markov model.

  • At each timestep tt, the RNN takes two types of input: input xtx_t is a current data point while hidden state ht1h_{t - 1} is input from the previous step.

  • The output also can be two types which are a new hidden state hth_t and prediction yty_t.

ht=σ(Wxhxt+Whhht1+bh)yt=Whyht+byh_t = \sigma{(W_{xh}\cdot x_t + W_{hh}\cdot h_{t - 1} + b_h)}\newline y_t = W_{hy}h_t + b_y
  • Wxh,Whh,WhyW_{xh}, W_{hh}, W_{hy} are weight matrices.

  • bh,byb_h, b_y represents bias terms.

  • When ht1=1h_{t - 1} = -1 indicates the initial state of the model, an initial vector IV\text{IV} ( of random or zero numbers) is used to fill in ht1h_{t-1}.

This basic architecture, known as a "vanilla RNN" (introduced by John Hopfield, 1982), suffers from problems such as vanishing/exploding gradients and short-term memory limitations. To address these issues, researchers have developed various RNN variants.

yt=σ(Wyht+Uyxtfused)y_t = \sigma(W_{y}h_{t} + U_{y}x_{t}^{\text{fused}})

FRNN (or Fully Recurrent Neural Network) uses a "fused input" that concatenates input and hidden states to capture all layer states. This method reduces GPU kernel launches and memory bandwidth usage.

They were the first artificial neural networks capable of processing variable-length sequences and capturing temporal dependencies. Until 2017, they dominated machine translation, text generation, and time-series forecasting. However, after being superseded by Transformers, they now remain a legacy architecture.

LSTM, Long Short-Term Memory Network

LSTMs were introduced (by Hochreiter and Schmidbhuber, 1997) to solve RNNs vanishing gradient problem. They use gating mechanisms to control information flow.

  • Hidden State hth_t is calculated by: ht=ottanh(Ct)h_t = o_t \circ \tanh(C_t)

    • CtC_t is called Cell state update, which passes through the previous forget gate: Ct=ftCt1+itCtˉC_t = f_t \circ C_{t-1} + i_t \circ \bar{C_t}

    • Ctˉ=tanh(WC[ht1,xt]+bC)\bar{C_t} = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) which is called Candidate Memory.

  • Forget Gate determines what to remove from memory: ft=σ(Wf[ht01,xt]+bf)f_t = \sigma (W_f \cdot [h_{t01}, x_t] + b_f)

  • Input Gate determines what new information to store: it=σ(Wi[ht1,xt]+bi)i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i )

  • Output Gate determines what to output based on memory: ot=σ(WO[ht1,xt]+bO)o_t = \sigma(W_O \cdot [h_{t-1}, x_t] + b_O)

LSTMs face significant challenges in capturing sequential data effectively. Their fixed architecture may not adapt well to various sequential dependencies, particularly long-range hierarchical patterns. They also lack scalability—adding new gates requires increasing parameters, which doesn't necessarily improve performance.

To address these limitations, modern AI systems have been developed more flexible architectures:

  • NTMs (Neural Turing Machines) and DNCs (Differentiable Neural Computers) use external memory banks with soft attention for read/write operations. However, their high computational costs make them impractical for most modern applications.

  • The Attention Mechanism and Transformer architecture have superseded recurrence by implementing self-attention, which better captures long-range dependencies without relying on fixed gates.

    • It employs a different type of "gating" called a Gating Network, also known as Mixture of Experts. This differs from LSTM gating—while LSTMs gate memory, this system gates which network the model should use.

  • Highway Networks and Adaptive Gating uses learnable gating function to control information flow across layers which is more flexible than LSTM’s fixed gates.

  • LTC, Liquid Time-Constant Networks use "neural ODEs" to model continuous-time dynamics, which enables input-dependent gating.

  • Memory-Augmented Networks (MANNs) combine Transformer architecture with dynamic memory slots for explicit pattern storage.

However, the concepts of "learnable gating" and “learnable long-term memory” for sequential data remain relevant in some transformer architectures, which also process sequential data.

Not only LSTMs and RNNs, but almost all sequential data processing models heavily rely on solutions such as batch/layer normalization, weight initialization, and gradient clipping. This makes their architecture overly fixed and not scalable across different data types.

Last updated