Vision Transformer

Pre-trained on large amounts of data, Vision Transformer (ViT) attains excellent results compared to sota and requires fewer computational resources to train.

Introduction·

split an image into patches

直接把 patch 的内容 embed 当成序列扔进 transformer

在中等大小的数据集上训练时比同等大小的 ResNet 要差一些在大型数据集14M-300M images上训练时ViT 效果大于等于 SOTA

Method·

image-20220626171812887

ViT·

reshape the image xRH×W×Cx \in R^{H \times W \times C} into a sequence of xpRN×(P2C)x_p \in R^{N \times (P^2 C)}其中 N=HW/P2N = HW/P^2 表示 patch 的数量PP 是 patch 的大小

embedding 就是 linear projection 从 patch 的大小到 latent vector 的维度 DD

加入了 position embedding (standard learnable 1D position embedding)

perfirm 2D interpolation of the pre-trained position embeddings may now longer be powerful

Hybrid Architecture·

用做完 CNN 以后的 feature map 作为 input sequence 而非原图

Fine-Tuning and higher resolution·

pre-train ViT on large datasets and fine-tune to downstream tasks

remove the prediction head and attach D×KD \times K 的一个 feedforward layerKK 是 downstream task 的 class 数量

  • ViT-Base 12 768 3072 12 86M

  • ViT-Large 24 1024 4096 16 307M

  • ViT-Huge 32 1280 5120 16 632M