Attention Is All You Need

Check out the original paper!

Before the Transformer architecture, most sequence transduction models relied on RNNs and CNNs with an encoder-decoder structure, though some incorporated attention mechanisms partially. The Transformer, however, is based solely on attention mechanisms, eliminating the need for recurrence and convolution entirely.

Introduction

RNNs, LSTMs, and GRNNs have been firmly established as state-of-the-art approaches in sequence modeling and transduction problems for NLP.

RNNs process input and output sequences symbol by symbol. By aligning positions with computational time steps, they generate a sequence of hidden states $h_t$ as a function of the previous hidden state $h_{t-1}$ and the input at position $t$ . However, this sequential approach prevents parallelization, which becomes critical with longer sequences, as memory constraints limit batch processing across examples.
Attention Mechanisms have become an integral part of sequence modeling, allowing modeling of dependencies regardless of their distance in the input or output sequences. They are often used in conjunction with a RNNs.

However, the Transformer, a model architecture, eschews recurrence and draws global dependencies between input and output by relying entirely on the attention mechanism.

Model Architecture

Most transduction models have an encoder-decoder structure. The Transformer also follows this architecture: the encoder maps an input sequence $(x_1, …, x_n)$ to a sequence of continuous representations $z = (z_1, …, z_n)$ , while the decoder produces $y = (y_1, …, y_m)$ with the given $z$ .

Figure

Encoder and Decoder Stacks

The encoder consists of $N = 6$ identical layers. Each layer contains two sublayers: a multi-head self-attention mechanism and a simple, fully connected feed-forward network. These sublayers are connected through residual connections and layer normalization.

Multi-Head Attention Layer: multi-head attention mechanism → residual connection → layer normalization
Feed Forward Layer(encoder): feed-forward network → residual connection → layer normalization

The decoder also consists of six identical layers, but with three sublayers each. Like the encoder, it uses residual connections and layer normalization. The decoder's attention layer is unique—it combines information from two sources: its own previous outputs through a "masked multi-head self-attention mechanism" and the encoder's outputs via an "encoder/decoder attention (or cross attention layer)".

Masked Attention Layer: masked self-attention → residual connection → layer normalization
Cross Attention Layer: multi-head attention mechanism → residual connection → layer normalization
Feed Forward Layer(decoder): feed-forward network → residual connection → layer normalization

The Transformer architecture has many variants, such as BERT-like and GPT-like models. BERT-like models use only the encoder, while GPT-like models use only the decoder, replacing the encoder output with the previous layer's output when calculating the cross-attention layer.

Attention Calculation

Self-attention is a variant of the attention mechanism that was introduced in "A Decomposable Attention Model for Natural Language Inference" (2016) under the name "decomposable attention". Later, it became the key component of transformer architecture, which is the revolution modern Al system.

Input $X$ is sequential data where the model must learn the meaning of each component.
Self-attention consists of these three soft weights:
- weights on the queries $W^Q$ learn information about "what to look for".
- weights on the keys $W^K$ learn information about "what I have".
- weights on the values $W^V$ are applied to produce the output.
Typically, the matrices $W^Q, W^K, W^V$ share the same shape, which is determined by the embedding dimension and number of heads.

The attention layer multiplies $X$ and $W^Q, W^K, W^V$ to create the following three matrices:
1. $Q = X \cdot W^Q$
2. $K = X \cdot W^K$
3. $V = X \cdot W^V$
Calculate the raw attention scores using: $A_{i,j} = Q_i \cdot K{j}^{\text{transpose}}$
To prevent gradient explosion, $d_k = \frac{\text{embedding dimension}}{\text{number of attention head}}$ is used to scale down the attention scores. This mechanism is called "scaled dot-product attention": $S = \frac{QK^{\text{transpose}}}{\sqrt{d_k}}$
To transform these scores into a probability distribution, apply the soft-max function: $A = \text{softmax}(S = \frac{QK^{\text{transpose}}}{\sqrt{d_k}})$
Finally, multiply these values with $V$ to produce the output: $\text{Self-Attention}(X)= A \cdot V$

The attention mechanism works like a “learnable database" that connects keys and queries for tokens. This design enables an asymmetrical relationship between queries and keys while allowing symmetrical calculations across matrices of keys, queries, and token vectors.

Multi-head attention was introduced as a core component of the transformer architecture. It enhances the ability to capture relationships and differences between input components by employing multiple attention units in parallel.

\text{Multi-Head}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h)W^O\newline \text{where each head is: } \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Multi-head attention mechanism employs multiple units called "attention heads" to calculate various attention patterns.
- The number of heads $h$ is used to index each attention head’s queries, keys and values.
- Scaled dot-product is applied to each head. The final scores from each head are concatenated linearly.
$W^Q_h, W^K_h, W_h^V$ are the soft weight matrices for each attention head, while $W^O$ is the soft weight matrix for the final output.

This parallel learning mechanism with multiple attention heads helps prevent overfitting.
High-performance AI models like GPT, BERT in NLP, ViT in image recognition, and Whisper in speech recognition employ multi-head attention as a core component.

The calculation logic of attention remains consistent, but the encoder/decoder attention in the decoder stacks operates uniquely.

The Multi-Head Self-Attention Layer in the encoder derives its queries, keys, and values from the input $x$ .
The Cross Multi-Head Self-Attention Layer takes its queries from the previous masked multi-head self-attention layer, while obtaining its keys and values from the encoder's output.
In the Masked Multi-Head Self-Attention Layer, queries, keys, and values are all derived from the previous layer. However, the layer employs "masking" to limit attention to specific sections of the input sequence, thus ignoring future or irrelevant tokens.

Masking is a crucial technique that determines a model's directionality and enhances its autoregressive capabilities.

BERT, which doesn't require unidirectional processing, uses masking only during pretraining. During this phase, 15% of input tokens undergo masking—80% are replaced with mask tokens, 10% with random tokens, and 10% remain unchanged.
GPT, which requires unidirectional processing, implements masking at every stack.

We will discuss this in more detail later.

Position-wise Feed-Forward Network

Each layer in both the encoder and decoder contains a fully connected feed-forward network, with ReLU activation functions between the layers.

\text{FFN}(x) = \max(0, x \cdot W_1 + b_1) \cdot W_2 + b_2

The each FFNs in the encoder and decoder share similar positions and shapes but use different parameters.

Embedding and Soft-max after Fully-Connected Layer

Embedding and soft-max after a fully-connected layer are positioned at the beginning and end of the architecture. Each embedding uses learned parameters to convert the input/output tokens into vectors of dimension $d_{\text{model}}$ . The soft-max function converts the decoder output, after linear transformation via a fully-connected layer, into predicted next-token probabilities.

Positional Encoding

Since the model has no recurrence or convolution, positional information must be injected into the input to maintain sequence order. The positional encoding must match the dimension $d_{\text{model}}$ , though there are several encoding methods to choose from.

\overrightarrow{p_t}= \begin{cases} \text{PE}_{(\text{pos}, i)} = \sin\big(\frac{\text{pos}}{10000^{(2i/d_{\text{model}})}}\big) \quad\text{if }i\text{ is even} \\ \text{PE}_{(\text{pos}, i)} = \cos\big(\frac{\text{pos}}{10000^{(2i/d_{\text{model}})}}\big)\quad\text{if }i \text{ is odd} \end{cases}

$\text{pos}$ represents position of the token, $i$ be dimension index.
$d_{\text{model}}$ is number of total dimension.

Check out the code i made!

PreviousPioneers of Sequential Data Processing Modeling: RNN and LSTMs NextOutrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Last updated 2 months ago