Attention Is All You Need
Before the Transformer architecture, most sequence transduction models relied on RNNs and CNNs with an encoder-decoder structure, though some incorporated attention mechanisms partially. The Transformer, however, is based solely on attention mechanisms, eliminating the need for recurrence and convolution entirely.
Introduction
RNNs, LSTMs, and GRNNs have been firmly established as state-of-the-art approaches in sequence modeling and transduction problems for NLP.
RNNs process input and output sequences symbol by symbol. By aligning positions with computational time steps, they generate a sequence of hidden states as a function of the previous hidden state and the input at position . However, this sequential approach prevents parallelization, which becomes critical with longer sequences, as memory constraints limit batch processing across examples.
Attention Mechanisms have become an integral part of sequence modeling, allowing modeling of dependencies regardless of their distance in the input or output sequences. They are often used in conjunction with a RNNs.
Model Architecture
Most transduction models have an encoder-decoder structure. The Transformer also follows this architecture: the encoder maps an input sequence to a sequence of continuous representations , while the decoder produces with the given .
Encoder and Decoder Stacks
The encoder consists of identical layers. Each layer contains two sublayers: a multi-head self-attention mechanism and a simple, fully connected feed-forward network. These sublayers are connected through residual connections and layer normalization.
Multi-Head Attention Layer: multi-head attention mechanism → residual connection → layer normalization
Feed Forward Layer(encoder): feed-forward network → residual connection → layer normalization
The decoder also consists of six identical layers, but with three sublayers each. Like the encoder, it uses residual connections and layer normalization. The decoder's attention layer is unique—it combines information from two sources: its own previous outputs through a "masked multi-head self-attention mechanism" and the encoder's outputs via an "encoder/decoder attention (or cross attention layer)".
Masked Attention Layer: masked self-attention → residual connection → layer normalization
Cross Attention Layer: multi-head attention mechanism → residual connection → layer normalization
Feed Forward Layer(decoder): feed-forward network → residual connection → layer normalization
Attention Calculation
Self-attention is a variant of the attention mechanism that was introduced in "A Decomposable Attention Model for Natural Language Inference" (2016) under the name "decomposable attention". Later, it became the key component of transformer architecture, which is the revolution modern Al system.
Input is sequential data where the model must learn the meaning of each component.
Self-attention consists of these three soft weights:
weights on the queries learn information about "what to look for".
weights on the keys learn information about "what I have".
weights on the values are applied to produce the output.
Typically, the matrices share the same shape, which is determined by the embedding dimension and number of heads.
The attention layer multiplies and to create the following three matrices:
Calculate the raw attention scores using:
To prevent gradient explosion, is used to scale down the attention scores. This mechanism is called "scaled dot-product attention":
To transform these scores into a probability distribution, apply the soft-max function:
Finally, multiply these values with to produce the output:
Multi-head attention was introduced as a core component of the transformer architecture. It enhances the ability to capture relationships and differences between input components by employing multiple attention units in parallel.
Multi-head attention mechanism employs multiple units called "attention heads" to calculate various attention patterns.
The number of heads is used to index each attention head’s queries, keys and values.
Scaled dot-product is applied to each head. The final scores from each head are concatenated linearly.
are the soft weight matrices for each attention head, while is the soft weight matrix for the final output.
The calculation logic of attention remains consistent, but the encoder/decoder attention in the decoder stacks operates uniquely.
The Multi-Head Self-Attention Layer in the encoder derives its queries, keys, and values from the input .
The Cross Multi-Head Self-Attention Layer takes its queries from the previous masked multi-head self-attention layer, while obtaining its keys and values from the encoder's output.
In the Masked Multi-Head Self-Attention Layer, queries, keys, and values are all derived from the previous layer. However, the layer employs "masking" to limit attention to specific sections of the input sequence, thus ignoring future or irrelevant tokens.
Masking is a crucial technique that determines a model's directionality and enhances its autoregressive capabilities.
BERT, which doesn't require unidirectional processing, uses masking only during pretraining. During this phase, 15% of input tokens undergo masking—80% are replaced with mask tokens, 10% with random tokens, and 10% remain unchanged.
GPT, which requires unidirectional processing, implements masking at every stack.
We will discuss this in more detail later.
Position-wise Feed-Forward Network
Each layer in both the encoder and decoder contains a fully connected feed-forward network, with ReLU activation functions between the layers.
Embedding and Soft-max after Fully-Connected Layer
Embedding and soft-max after a fully-connected layer are positioned at the beginning and end of the architecture. Each embedding uses learned parameters to convert the input/output tokens into vectors of dimension . The soft-max function converts the decoder output, after linear transformation via a fully-connected layer, into predicted next-token probabilities.
Positional Encoding
Since the model has no recurrence or convolution, positional information must be injected into the input to maintain sequence order. The positional encoding must match the dimension , though there are several encoding methods to choose from.
represents position of the token, be dimension index.
is number of total dimension.
Last updated