DETR DEtection TRansformer

Introduction·

a set prediction loss that forces unique matching between predicted and ground truth boxes
an architecture that predicts a set of objects and models their relation

in a single pass, infers a fixed-size of $N$ predictions

$N$ 是一个预设的较大的数，远大于图中物体总数

假设 $y$ 是所有的 ground truth objects， $\hat{y}$ 是所有的 predictions。

Find a permutation $\sigma$ with the lowest cost:

（ $\hat{y}$ 的数量大于 $y$ ，可以把 $y$ 用 $\emptyset$ 补到一样大小）

用匈牙利做二分匹配可以快速得到 $\hat \sigma$

$L_{match}$ : 每个 ground truth 和 prediction 都可以视为 $(c,b)$ 其中 $c$ 是分类， $b$ 是 bounding box（一个四元组） $L_{match}$ 定义如下：

最终的loss如下

从初始图片 $x \in R^{3 \times H_0 \times W_0}$ 做 CNN 提取特征，得到 $f \in R^{C \times H \times W}$ , where $C = 2048, H, W = \frac {H_0} {32}, \frac {W_0} {32}$

首先做一个 $1 \times 1$ 的卷积，把 $2048$ channel 降低到 $d$ ，得到 $z_0 \in R^{d \times H \times W}$

直接摊平成一维，得到 $d \times HW$ 的向量，进 transformer

multi-head self-attention

transform $N$ embeddings of size $d$