An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Check out the original paper!
Self-attention-based architectures, in particular Transformer, have become the model of choice in NLP. The dominant approach is to pretrain on a large text corpus and then finetune on a smaller task-specific dataset. Thanks to Transformer’s computational efficiency and scalability, it has become possible to train models of unprecedented size, with over 100B parameters.
In computer vision, however, convolutional architectures remain dominant. Inspired by NLP successes, multiple works have tried combining CNN-like architectures with self-attention—some replacing convolutions entirely. These latter models use specialized attention patterns. Therefore, in large-scale image recognition, classic ResNet-like architectures are still state of the art (today, ViTs have already surpassed ResNet-like architectures).
Introduction
The researchers experimented with applying a standard Transformer directly to images with minimal modifications. They split an image into patches and provide the sequence of linear embeddings of these patches as input to a Transformer. Image patches are treated the same way as tokens in NLP applications. The model is trained on image classification in a supervised fashion.
When trained on mid-sized datasets like ImageNet without strong regularization, these models achieve modest accuracies—a few percentage points below ResNets of comparable size. This seemingly discouraging result is expected: Transformers lack the inductive biases inherent to CNNs, such as translation equivariance and locality. As a result, they don't generalize well when trained on insufficient data.
However, the picture changes when models are trained on larger datasets (14–300M images). Large-scale training trumps inductive bias. The Vision Transformer (ViT) attains excellent results when pretrained at sufficient scale and transferred to tasks with fewer datapoints. When pretrained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches or beats state-of-the-art image recognition benchmarks.
Method
The model design follows the original Transformer as closely as possible. This intentionally simple setup offers a key advantage: scalable NLP architectures and their efficient implementations can be used almost out of the box.
Figure 1 shows an overview of the model. The standard Transformer receives a 1D sequence of token embeddings as input. To handle 2D images, we reshape the image x into a sequence of flattened 2D patches xp using embed embed(x→xp):
C is the number of channels.
(H,W) is the resolution of the original image (height and width).
(P,P) is the resolution of each image patch (expressed as xp).
N=HW/P2 is the resulting number of patches, which also serves as the input sequence length for the Transformer.
Like BERT's [class] token, we add a learnable embedding to the start of the embedded patch sequence (z00xclass). Its state at the Transformer encoder's output (zL0) becomes the image representation y.
z0=[xclass;xp1E;xp2E,...,xpNE]+Eposwhere is E∈R(P2⋅C)×D,Epos∈R(N+1)×D
z’l=MSA(LN(zl−1))+zl−1,where is l=1,...,L
zl=MLP(LN(z′l))+z′l,where is l=1,...,L
y=LN(zL0)
Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, 2D neighborhood structure and translation equivariance are built into each layer. In ViT, only the MLPs are local and translationally equivariant, while the self-attention layers are global. Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.
Last updated