DETR DEtection TRansformer

image-20220630144049347

Introduction·

  • predict all objects at once
  • bipartite matching between predicted and ground-truth objects
  • 抛弃了许多之前的技巧比如 spatial anchors 或者 non-maximal supression

The DETR model·

  • a set prediction loss that forces unique matching between predicted and ground truth boxes
  • an architecture that predicts a set of objects and models their relation

Object detection set prediction loss·

in a single pass, infers a fixed-size of NN predictions

NN 是一个预设的较大的数远大于图中物体总数

假设 yy 是所有的 ground truth objectsy^\hat{y} 是所有的 predictions

Find a permutation σ\sigma with the lowest cost:

image-20220630145321234

y^\hat{y} 的数量大于 yy可以把 yy\emptyset 补到一样大小

用匈牙利做二分匹配可以快速得到 σ^\hat \sigma

LmatchL_{match}: 每个 ground truth 和 prediction 都可以视为 (c,b)(c,b) 其中 cc 是分类bb 是 bounding box一个四元组LmatchL_{match} 定义如下

image-20220630145626483

最终的loss如下

image-20220630150027430

architecture·

Backbone·

从初始图片 xR3×H0×W0x \in R^{3 \times H_0 \times W_0} 做 CNN 提取特征得到 fRC×H×Wf \in R^{C \times H \times W}, where C=2048,H,W=H032,W032C = 2048, H, W = \frac {H_0} {32}, \frac {W_0} {32}

Transformer encoder·

首先做一个 1×11 \times 1 的卷积20482048 channel 降低到 dd得到 z0Rd×H×Wz_0 \in R^{d \times H \times W}

直接摊平成一维得到 d×HWd \times HW 的向量进 transformer

multi-head self-attention

Transformer decoder·

transform NN embeddings of size dd