Block-State Transformers
SSMs have shown impressive results on tasks like modeling LRD and long sequence learning. However, they still lag to Transformers in Language Modeling tasks. In this work, BST propose a hybrid layer named BST (Block-State Transformer), that internally combines a SSM sublayer for LRD contextualization and a BST sublayer for short-term representation. It includes three different and completely parallelable, variant that integrate SSMs and block-wise attention.
Introduction
Transformers outperform on a wide range of NLP and also successfully replace RNNs. The benefits of self-attention of Transformers are two fold: Fold 1, the capacity of what could be stored and directly accessible as context is drastically increased. Fold 2, training on longer sequences is more stable.
While Transformers is achieving SOTA on reasoning and question answering, the demand for deploying even deeper and larger networks is now a great concern. Despite the several advantages of Transformers over RNNs, it still problematic to scale its input sequence length: Problem 1, Transformer’s runtime is quadratic with respect to the input sequence length, which makes training these models increasingly expensive. Problem 2, it struggles on simple long-input classification tasks—although there are solution for that, vanilla transformers can be unstable when trained on long sequences which is exactly over concentrated in a local receptive filed of around 50 tokens around the current time step.
An emerging body of research suggests that SSMs can serve as an alternative to Transformers because they are able to capture extremely LRD, while being more computationally efficient and more parallelization.
Method
State Space Preliminaries

State Spaces (structured kernels): S4, S5, S4D, DSS, follow a structured initialization of the convolutional kernel by unrolling a linear time-invariant (LTI) dynamical system of following form (however, BST employees a lot different state structure from normal LTI generalization):
Definition and Initialization: The system is parameterized by a state matrix , vectors , and .
The SSM maps a 1-D input signal into a 1-D output signal .
Internally, it projects the input signal before mapping it down to a scalar using the .
The term is sort of a skip connection (just like a Gating).
The output of the above recurrent equation, , can be computed as a discrete convolution:
The term entries are collected to create the SSM kernel .
Parameterized Filters
There are several smoothing techniques like regularization and PE (Positional Encoding), which can be expressed as:
Last updated