Pioneers of Sequential Data Processing Modeling: RNN and LSTMs

A Recurrent Neural Network ( or RNN) is a neural network that processes sequential data step by step, using the concept of a hidden state. This hidden state captures information from previous steps and can be viewed as a variant of a Markov model.

  • At each timestep tt, the RNN takes two types of input: input xtx_t is a current data point while hidden state ht1h_{t - 1} is input from the previous step.

  • The output also can be two types which are a new hidden state hth_t and prediction yty_t.

ht=σ(Wxhxt+Whhht1+bh)yt=Whyht+byh_t = \sigma{(W_{xh}\cdot x_t + W_{hh}\cdot h_{t - 1} + b_h)}\newline y_t = W_{hy}h_t + b_y
  • Wxh,Whh,WhyW_{xh}, W_{hh}, W_{hy} are weight matrices.

  • bh,byb_h, b_y represents bias terms.

  • When ht1=1h_{t - 1} = -1 indicates the initial state of the model, an initial vector IV\text{IV} ( of random or zero numbers) is used to fill in ht1h_{t-1}.

This basic architecture, known as a "vanilla RNN" (introduced by John Hopfield, 1982), suffers from problems such as vanishing/exploding gradients and short-term memory limitations. To address these issues, researchers have developed various RNN variants.

yt=σ(Wyht+Uyxtfused)y_t = \sigma(W_{y}h_{t} + U_{y}x_{t}^{\text{fused}})

FRNN (or Fully Recurrent Neural Network) uses a "fused input" that concatenates input and hidden states to capture all layer states. This method reduces GPU kernel launches and memory bandwidth usage.

circle-info

They were the first artificial neural networks capable of processing variable-length sequences and capturing temporal dependencies. Until 2017, they dominated machine translation, text generation, and time-series forecasting. However, after being superseded by Transformers, they now remain a legacy architecture.

LSTM, Long Short-Term Memory Network

LSTMs were introduced (by Hochreiter and Schmidbhuber, 1997) to solve RNNs vanishing gradient problem. They use gating mechanisms to control information flow.

  • Hidden State hth_t is calculated by: ht=ottanh(Ct)h_t = o_t \circ \tanh(C_t)

    • CtC_t is called Cell state update, which passes through the previous forget gate: Ct=ftCt1+itCtˉC_t = f_t \circ C_{t-1} + i_t \circ \bar{C_t}

    • Ctˉ=tanh(WC[ht1,xt]+bC)\bar{C_t} = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) which is called Candidate Memory.

  • Forget Gate determines what to remove from memory: ft=σ(Wf[ht01,xt]+bf)f_t = \sigma (W_f \cdot [h_{t01}, x_t] + b_f)

  • Input Gate determines what new information to store: it=σ(Wi[ht1,xt]+bi)i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i )

  • Output Gate determines what to output based on memory: ot=σ(WO[ht1,xt]+bO)o_t = \sigma(W_O \cdot [h_{t-1}, x_t] + b_O)

LSTMs face significant challenges in capturing sequential data effectively. Their fixed architecture may not adapt well to various sequential dependencies, particularly long-range hierarchical patterns. They also lack scalability—adding new gates requires increasing parameters, which doesn't necessarily improve performance.

To address these limitations, modern AI systems have been developed more flexible architectures:

  • NTMs (Neural Turing Machines) and DNCs (Differentiable Neural Computers) use external memory banks with soft attention for read/write operations. However, their high computational costs make them impractical for most modern applications.

  • The Attention Mechanism and Transformer architecture have superseded recurrence by implementing self-attention, which better captures long-range dependencies without relying on fixed gates.

    • It employs a different type of "gating" called a Gating Network, also known as Mixture of Experts. This differs from LSTM gating—while LSTMs gate memory, this system gates which network the model should use.

  • Highway Networks and Adaptive Gating uses learnable gating function to control information flow across layers which is more flexible than LSTM’s fixed gates.

  • LTC, Liquid Time-Constant Networks use "neural ODEs" to model continuous-time dynamics, which enables input-dependent gating.

  • Memory-Augmented Networks (MANNs) combine Transformer architecture with dynamic memory slots for explicit pattern storage.

However, the concepts of "learnable gating" and “learnable long-term memory” for sequential data remain relevant in some transformer architectures, which also process sequential data.

circle-info

Not only LSTMs and RNNs, but almost all sequential data processing models heavily rely on solutions such as batch/layer normalization, weight initialization, and gradient clipping. This makes their architecture overly fixed and not scalable across different data types.

Last updated