计算机视觉算法——基于Transformer的目标检测（DETR / Deformable DETR / Dynamic DETR / DETR 3D）

计算机视觉算法——基于Transformer的目标检测（DETR / Deformable DETR / Dynamic DETR / DETR 3D）1. DETR1.1 Transformer Encoder-Decoder1.2 Set-to-Set Loss1.3 Positional Embedding1.4 Query Embedding

2. Deformable DETR2.1 Deformable Attention Module2.2 Deformable Transformer Encoder-Decoder2.3 Additional Improvement2.3.1 Iterative Bounding Box Refinement2.3.2 Two Stage

2.4 Conclusion

3. Dynamic DETR3.1 Dynamic Encoder3.2 Dynamic Decoder

4. DETR3D4.1 2D to 3D Transfomer4.2 Set-to-Set Loss

计算机视觉算法——基于Transformer的目标检测（DETR / Deformable DETR / Dynamic DETR / DETR 3D）

DETR是DEtection TRansformer的缩写，该方法发表于2020年ECCV，原论文名为《End-to-End Object Detection with Transformers》。

传统的目标检测是基于Proposal、Anchor或者None Anchor的方法，并且至少需要非极大值抑制来对网络输出的结果进行后处理，涉及到复杂的调参过程。而DETR使用了Transformer Encoder-Decoder的结构，并且通过集合预测损失实现了真正意义上的端到端的目标检测方法。Transformer Encoder-Decoder是怎么实现的？集合预测损失是什么？后文具体介绍。

对于目标检测方向不是很了解的同学可以参考计算机视觉算法——目标检测网络总结。

1. DETR

DETR网络结构如下图所示：首先第一步是通过一个CNN对输入图片抽取特征，然后将特征图拉直输入Transformer Encoder-Decoder。第二步的Transformer Encoder部分就是使得网络更好地去学习全局的特征；第三步使用Transformer Decoder以及Object Query从特征中学习要检测的物体；第四步就是将Object Query的结果和真值进行二分图匹配（Set-to-Set Loss），最后在匹配上的结果上计算分类Loss和位置回归Loss。

以上是训练的基本过程，推理过程唯一的区别就是在第四步，第四步通过对Object Query设置一个阈值来输出最终检测的结果，这个结果不再需要进行进行任何后处理，而是直接作为最终的输出。

下面我们结合代码具体展开Transformer Encoder-Decoder和Set-to-Set Loss的细节：

1.1 Transformer Encoder-Decoder

Transformer Encoder-Decoder结构如下图所示，其中红色注释为输入为

800

1066

3\times 800\times 1066

3×800×1066大小的图片后各个步骤Feature大小。 Transformer Encoder-Decoder的forward函数如下：

def forward(self, src, mask, query_embed, pos_embed):

# flatten NxCxHxW to HWxNxC

bs, c, h, w = src.shape

src = src.flatten(2).permute(2, 0, 1)

pos_embed = pos_embed.flatten(2).permute(2, 0, 1)

query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)

mask = mask.flatten(1)

tgt = torch.zeros_like(query_embed)

memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)

hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,

pos=pos_embed, query_pos=query_embed)

return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)

其中

src为Backbone抽取后的特征，输入Encoder前需要先对其进行切块展平处理；pos_embed为位置编码，在DETR中位置编码是一个值固定的位置编码，具体参见下文有1.3中的介绍；query_embed是一个可学习的位置编码，也就是上文体到的Object Query，其作用在Decoder中就是通过Encoder后的Feature和query_embed不断做Cross Attention，最query_embed的每一维就是一个检测结果的输出;mask是DETR为了兼容不同分辨率图像作为输入，会在输入时将不同分别的图像Zero Padding成固定分辨率，Zero Padding部分不包含任何信息，因此不能用来计算Attention，因此作者在这里保留将Zero Padding部分传入了src_key_padding_mask。

接下来的Encoder-Decoder部分和《Attention is All You Need》中几乎一致，Encoder层结构如下图所示：代码如下：

def forward_post(self,

src,

src_mask: Optional[Tensor] = None,

src_key_padding_mask: Optional[Tensor] = None,

pos: Optional[Tensor] = None):

q = k = self.with_pos_embed(src, pos)

src2 = self.self_attn(q, k, value=src, attn_mask=src_mask,

key_padding_mask=src_key_padding_mask)[0]

src = src + self.dropout1(src2)

src = self.norm1(src)

src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))

src = src + self.dropout2(src2)

src = self.norm2(src)

return src

Decoder结构如下图所示：代码如下：

def forward_post(self, tgt, memory,

tgt_mask: Optional[Tensor] = None,

memory_mask: Optional[Tensor] = None,

tgt_key_padding_mask: Optional[Tensor] = None,

memory_key_padding_mask: Optional[Tensor] = None,

pos: Optional[Tensor] = None,

query_pos: Optional[Tensor] = None):

q = k = self.with_pos_embed(tgt, query_pos)

tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,

key_padding_mask=tgt_key_padding_mask)[0]

tgt = tgt + self.dropout1(tgt2)

tgt = self.norm1(tgt)

tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),

key=self.with_pos_embed(memory, pos),

value=memory, attn_mask=memory_mask,

key_padding_mask=memory_key_padding_mask)[0]

tgt = tgt + self.dropout2(tgt2)

tgt = self.norm2(tgt)

tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))

tgt = tgt + self.dropout3(tgt2)

tgt = self.norm3(tgt)

return tgt

这里有个细节，除了第一层之外，query_embed在做Cross Attention之前，都需必须要做一次Self Attention，Self-Attention各个Query了解其他Query掌握的信息。

最后总结下，Transformer Encoder-Decoder有什么好处呢？我觉得Transformer Encoder-Decoder应该是Set-to-Set成功的原因之一，在DETR之前其实也有一些文章提出Set-to-Set的想法，但是由于网络使用的Backbone不够强，因此并没有取得很好的效果。而Transformer Encoder-Decoder学习的是全局的特征，它可以使得其中某一个特征与全局里的其他特征都有了交互，网络就能更加清楚的知道哪里是一个物体，哪里是另外一个物体，一个物体应该就是对应一个输出，这也就更加符合Set-to-Set的假设。

1.2 Set-to-Set Loss

所谓Set-to-Set Loss就是将在计算网络损失前加一个二分图匹配的过程，使得最后预测结果只和匹配上的真值计算损失，如下公式所示：

arg

⁡

min

⁡

∈

∑

match

⁡

(

)

\hat{\sigma}=\underset{\sigma \in \mathfrak{S}_{N}}{\arg \min } \sum_{i}^{N} \mathcal{L}_{\operatorname{match}}\left(y_{i}, \hat{y}_{\sigma(i)}\right)

σ^=σ∈SNargmini∑NLmatch(yi,y^σ(i))其中

y_{i}

yi为真值，

(

)

\hat{y}_{\sigma(i)}

y^σ(i)为预测值，

match

⁡

\mathcal{L}_{\operatorname{match}}

Lmatch为二分图匹配算法，对二分图匹配不熟悉的同学可以参考视觉SLAM总结——SuperPoint / SuperGlue中的介绍，区别是，在DETR代码码实现中调用的是scipy库中的linear_sum_assignment函数，该函数输入一个

M\times N

M×N大小的Cost矩阵能计算

M和

N之间的匹配关系，在DETR中Cost矩阵由分类损失

(

)

(

)

\hat{p}_{\sigma(i)}\left(c_{i}\right)

p^σ(i)(ci)和Box损失

(

)

\mathcal{L}_{\mathrm{box}}\left(b_{i}, \hat{b}_{\sigma(i)}\right)

Lbox(bi,b^σ(i))两部分构成，分类损失为负的Softmax后的概率，Box损失为L1损失和Generalized IOU损失两部分构成如下：

def forward(self, outputs, targets):

""" Performs the matching

Params:

outputs: This is a dict that contains at least these entries:

"pred_logits": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits

"pred_boxes": Tensor of dim [batch_size, num_queries, 4] with the predicted box coordinates

targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:

"labels": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth

objects in the target) containing the class labels

"boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates

Returns:

A list of size batch_size, containing tuples of (index_i, index_j) where:

- index_i is the indices of the selected predictions (in order)

- index_j is the indices of the corresponding selected targets (in order)

For each batch element, it holds:

len(index_i) = len(index_j) = min(num_queries, num_target_boxes)

"""

bs, num_queries = outputs["pred_logits"].shape[:2]

# We flatten to compute the cost matrices in a batch

out_prob = outputs["pred_logits"].flatten(0, 1).softmax(-1) # [batch_size * num_queries, num_classes]

out_bbox = outputs["pred_boxes"].flatten(0, 1) # [batch_size * num_queries, 4]

# Also concat the target labels and boxes

tgt_ids = torch.cat([v["labels"] for v in targets])

tgt_bbox = torch.cat([v["boxes"] for v in targets])

# Compute the classification cost. Contrary to the loss, we don't use the NLL,

# but approximate it in 1 - proba[target class].

# The 1 is a constant that doesn't change the matching, it can be ommitted.

cost_class = -out_prob[:, tgt_ids]

# Compute the L1 cost between boxes

cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)

# Compute the giou cost betwen boxes

cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))

# Final cost matrix

C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou

C = C.view(bs, num_queries, -1).cpu()

sizes = [len(v["boxes"]) for v in targets]

indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]

return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]

在求得匹配结果

\hat{\sigma}

σ^后最终的损失大小为：

Hungarian

(

)

∑

[

−

log

⁡

(

)

(

)

{

≠

∅

}

(

)

]

\mathcal{L}_{\text {Hungarian }}(y, \hat{y})=\sum_{i=1}^{N}\left[-\log \hat{p}_{\hat{\sigma}(i)}\left(c_{i}\right)+\mathbb{1}_{\left\{c_{i} \neq \varnothing\right\}} \mathcal{L}_{\mathrm{box}}\left(b_{i}, \hat{b}_{\hat{\sigma}}(i)\right)\right]

LHungarian (y,y^)=i=1∑N[−logp^σ^(i)(ci)+1{ci=∅}Lbox(bi,b^σ^(i))]和匹配过程中使用的损失稍有不同的是这里的分类损失采用了通用的Cross Entropy损失，为什么有这样的区别论文中好像没有提到。在前文也提到过，二分匹配的过程仅仅出现在训练过程中，在测试过程中直接将网络输出的结果过一个阈值就得到最终的输出结果。

1.3 Positional Embedding

DETR中的Positional Embedding是一个固定值，Positional Embedding的代码如下，我们来简单剖析下：

class PositionEmbeddingSine(nn.Module):

"""

This is a more standard version of the position embedding, very similar to the one

used by the Attention is all you need paper, generalized to work on images.

"""

def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):

super().__init__()

self.num_pos_feats = num_pos_feats

self.temperature = temperature

self.normalize = normalize

if scale is not None and normalize is False:

raise ValueError("normalize should be True if scale is passed")

if scale is None:

scale = 2 * math.pi

self.scale = scale

def forward(self, tensor_list: NestedTensor):

x = tensor_list.tensors

mask = tensor_list.mask

assert mask is not None

not_mask = ~mask

y_embed = not_mask.cumsum(1, dtype=torch.float32)

x_embed = not_mask.cumsum(2, dtype=torch.float32)

if self.normalize:

eps = 1e-6

y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale

x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)

dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)

pos_x = x_embed[:, :, :, None] / dim_t

pos_y = y_embed[:, :, :, None] / dim_t

pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)

pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)

pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)

return pos

为了使得网络感知到不同输入的位置信息，最直观的方式就是给第一个Feature赋值

1，第二个Feature赋值

2，但是这种赋值方式对于较大的输入是不友好的，因此有人提出使用正弦函数将值控制在

−

-1

−1和

1之间，但是正弦函数又具备周期性，可能会造成不同位置值相同的情况。

因此作者将正弦函数扩展到

d维向量，不同通道具备不同的波长，如下：

(

)

sin

⁡

(

1000

model

)

P E_{(p o s, 2 i)}=\sin \left(p o s / 10000^{2 i / d_{\text {model }}}\right)

PE(pos,2i)=sin(pos/100002i/dmodel )

(

pos

)

cos

⁡

(

1000

model

)

P E_{(\text {pos }, 2 i+1)}=\cos \left(p o s / 10000^{2 i / d_{\text {model }}}\right)

PE(pos ,2i+1)=cos(pos/100002i/dmodel )其中

i为通道数，举例来说，我们令

d=6

d=6那么：

[

]

{i}=[1,2,3,4,5,6]

i=[1,2,3,4,5,6]

[

1000

]

w_i=\left[\frac{1}{10000^{1 / 6}}, \frac{1}{10000^{2 / 6}}, \frac{1}{10000^{3 / 6}}, \frac{1}{10000^{4 / 6}}, \frac{1}{10000^{5 / 6}}, \frac{1}{10000^{6 / 6}}\right]

wi=[100001/61,100002/61,100003/61,100004/61,100005/61,100006/61]当

Poision=2

Poision=2时，得到：

[

sin

⁡

(

)

cos

⁡

(

)

sin

⁡

(

)

cos

⁡

(

)

sin

⁡

(

)

cos

⁡

(

)

]

Position Encoding=\left[\sin \left(2 w_{0}\right), \cos \left(2 w_{1}\right), \sin \left(2 w_{2}\right), \cos \left(2 w_{3}\right), \sin \left(2 w_{4}\right), \cos \left(2 w_{5}\right)\right]

PositionEncoding=[sin(2w0),cos(2w1),sin(2w2),cos(2w3),sin(2w4),cos(2w5)]这样得到的一个多维向量在不同位置上很难会相同，因此也就达到对不同位置进行编码的效果。

DETR提出后以其简单的结构很快得到了大家的关注，DETR本身也存在很多问题，例如训练收敛速度不够快，结果不够SOTA，对小物体检出效果较差等，因此紧接着就出现了很多DETR相关的算法，例如Deformable DETR、Anchor DETR等，以及应用到自动驾驶领域的DETR3D等等，这里我对其中部分算法简单总结下。

1.4 Query Embedding

我们知道DETR是通过Query从Feature获取检测结果，那么Query本身是如何设置的呢？通过源码我们可以看到，Query是由两部分构成，其中一部分为表示位置信息的query_embed，其通过nn.Embedding进行初始化的：

self.query_embed = nn.Embedding(num_queries, hidden_dim)

另外一部分为表示内容信息的tgt，其通过全零初始化，作者在Issue中解释The reason why we zero it is just to avoid passing twice the same vector in the first layer, but we could have avoided that as well and it would work.

tgt = torch.zeros_like(query_embed)

在Decoder中，tgt和query_embed会传入每一层Decoder Layer

def forward(self, tgt, memory,

tgt_mask: Optional[Tensor] = None,

memory_mask: Optional[Tensor] = None,

tgt_key_padding_mask: Optional[Tensor] = None,

memory_key_padding_mask: Optional[Tensor] = None,

pos: Optional[Tensor] = None,

query_pos: Optional[Tensor] = None):

output = tgt

intermediate = []

for layer in self.layers:

output = layer(output, memory, tgt_mask=tgt_mask,

memory_mask=memory_mask,

tgt_key_padding_mask=tgt_key_padding_mask,

memory_key_padding_mask=memory_key_padding_mask,

pos=pos, query_pos=query_pos)

if self.return_intermediate:

intermediate.append(self.norm(output))

if self.norm is not None:

output = self.norm(output)

if self.return_intermediate:

intermediate.pop()

intermediate.append(output)

if self.return_intermediate:

return torch.stack(intermediate)

return output.unsqueeze(0)

在每一层Decoder Layer中，都会对tgt和query_embed相加后作为Query进行Self-Attention和Cross-Attention，详见上述Decoder Layer代码。因此最后输出的output会同时包括tgt和query_embed两部分信息，因此最后的检测头是直接从output中回归出框的类别和位置。对于Query更加详细的分析理解感兴趣的同学可以去阅读下Conditional DETR和DAB相关的论文。

2. Deformable DETR

Deformable DETR主要解决原始DETR训练速度慢以及对小物体检测效果差的问题。DETR收敛速度慢主要是由于Attention Map从均匀分布到稀疏分布的训练过程非常耗时，对于小物体检测效果差主要是因为Backbone没有多尺度特征，但即使有，将多尺度特征输入Transformer也不现实，因为Transfomer的计算复杂度是

(

)

O(n^2)

O(n2)，高分辨率的特征会带来巨大的内存和时间消耗。

为此，Deformable DETR提出了Defomer Attention模块使得以上问题得到了很好的解决。

2.1 Deformable Attention Module

作者首先在论文中介绍了原始Transformer中多头注意力机制的公式：

MultiHeadAttn

(

)

∑

[

∑

∈

⋅

′

]

\text { MultiHeadAttn }\left(z_{q}, x\right)=\sum_{m=1}^{M} W_{m}\left[\sum_{k \in \Omega_{k}} A_{m q k} \cdot W_{m}^{\prime} x_{k}\right]

MultiHeadAttn (zq,x)=m=1∑MWm[k∈Ωk∑Amqk⋅Wm′xk]这和我们平常看到的公式原理是相同的，只是表达略微有些不同。其中

z_{q}, x

zq,x分别为进行Attention的两组向量，

V_{m} x

Vmx得到Key Embedding，

U_{m} z_{q}

Umzq得到Query Embedding，

A_{m q k}

Amqk为Query Embedding和Key Embedding点乘后的归一化得到权重，正比于

exp

⁡

{

}

\exp \left\{\frac{z_{q}^{T} U_{m}^{T} V_{m} x_{k}}{\sqrt{C_{v}}}\right\}

exp{Cv

zqTUmTVmxk}。

′

W_{m}^{\prime} x_{k}

Wm′xk为Value Embedding，

W_{m}

Wm则负责将Concate后的多头结果进行聚合。其中

′

U_{m}, V_{m}, W_{m}^{\prime}, W_{m}

Um,Vm,Wm′,Wm均为学习的参数。在DETR中就是将这样原始的多头注意力机制应用到Encoder的Self-Attention和Decoder的Cross-Attention中。

接下来作者介绍了Deformable Attention Module的原理，表达公式为：

DeformAttn

⁡

(

)

∑

[

∑

⋅

′

(

)

]

\operatorname{DeformAttn}\left(\boldsymbol{z}_{q}, \boldsymbol{p}_{q}, \boldsymbol{x}\right)=\sum_{m=1}^{M} \boldsymbol{W}_{m}\left[\sum_{k=1}^{K} A_{m q k} \cdot \boldsymbol{W}_{m}^{\prime} \boldsymbol{x}\left(\boldsymbol{p}_{q}+\Delta \boldsymbol{p}_{m q k}\right)\right]

DeformAttn(zq,pq,x)=m=1∑MWm[k=1∑KAmqk⋅Wm′x(pq+Δpmqk)]其中

\Delta p_{m q k}

Δpmqk是从Query Embedding获得的位置偏移，而公式中的

A_{m q k}

Amqk也不再是通过Query Embedding和Key Embedding点乘获得的权重，而是直接从Query Embedding获得的权重，这一过程可以通过下图去理解：其中Reference Point的获取方式在Encoder中是通过torch.meshgrid直接生成的，在Decoder中是将Query通过线性层生成的。

与DETR中使用的多头注意力机制不同的点在于：

DETR中使用的多头注意力机制是使用全局特征作为Key值，而Deformable Attention是在每个Query附近，通过Query Embedding自主选取

K个Key值；DETR中使用的多头注意力机制是通过Key Embedding和Query Embedding做内积获得权重，而Deformable Attention则是直接由Query Embedding经过一个线性层获得。

也正是以上两处不同点，使得Deformable Attention相对原始的Attention机制要更加高效。另外补充下，Deformable Attention和Deformable Convolution也是有不同的，Deformable Attention是在Query位置的一个点上直接预测多个偏移量，而Deformable Convolution则是对卷积核内的每个像素都预测一个偏移量。

在Deformable Attention Module的基础上，作者又进一步提出了Multi Scale Deformable Attention Module，公式如下：

MSDeformAttn

⁡

(

{

}

)

∑

[

∑

⋅

′

(

)

]

\operatorname{MSDeformAttn}\left(z_{q}, \hat{\boldsymbol{p}}_{q},\left\{x^{l}\right\}_{l=1}^{L}\right)=\sum_{m=1}^{M} W_{m}\left[\sum_{l=1}^{L} \sum_{k=1}^{K} A_{m l q k} \cdot \boldsymbol{W}_{m}^{\prime} \boldsymbol{x}^{l}\left(\phi_{l}\left(\hat{\boldsymbol{p}}_{q}\right)+\Delta \boldsymbol{p}_{m l q k}\right)\right]

MSDeformAttn(zq,p^q,{xl}l=1L)=m=1∑MWm[l=1∑Lk=1∑KAmlqk⋅Wm′xl(ϕl(p^q)+Δpmlqk)]和Deformable Attention Module相比，区别主要在于Deformable Attention Module是从当前层采样

K个位置，而Multi Scale Deformable Attention Module则是从

L层每层采样

K个位置，共

LK个采样位置。这样就使得网络以一个较小的代价实现了多尺度特征的融合。

class MSDeformAttn(nn.Module):

def __init__(self, d_model=256, n_levels=4, n_heads=8, n_points=4):

"""

Multi-Scale Deformable Attention Module

:param d_model hidden dimension

:param n_levels number of feature levels

:param n_heads number of attention heads

:param n_points number of sampling points per attention head per feature level

"""

super().__init__()

if d_model % n_heads != 0:

raise ValueError('d_model must be divisible by n_heads, but got {} and {}'.format(d_model, n_heads))

_d_per_head = d_model // n_heads

# you'd better set _d_per_head to a power of 2 which is more efficient in our CUDA implementation

if not _is_power_of_2(_d_per_head):

warnings.warn("You'd better set d_model in MSDeformAttn to make the dimension of each attention head a power of 2 "

"which is more efficient in our CUDA implementation.")

self.im2col_step = 64 # 用于cuda算子

self.d_model = d_model # 特征通道数=256

self.n_levels = n_levels # 多尺度特征=4

self.n_heads = n_heads # 多头=8

self.n_points = n_points # 采样点个数=4

# 采样点的坐标偏移offset

# 每个query在每个注意力头和每个特征层都需要采样4个采样点，每个采样点为一个2d坐标，n_heads * n_levels * n_points * 2 = 8x4x4x2= 256

self.sampling_offsets = nn.Linear(d_model, n_heads * n_levels * n_points * 2)

# 每个query对应的所有采样点的注意力权重，n_heads * n_levels * n_points = 8x8x4=128

self.attention_weights = nn.Linear(d_model, n_heads * n_levels * n_points)

# value的线性变换矩阵

self.value_proj = nn.Linear(d_model, d_model)

# output的线性变换矩阵

self.output_proj = nn.Linear(d_model, d_model)

self._reset_parameters()

def _reset_parameters(self):

# 初始化采样点的位置偏置

constant_(self.sampling_offsets.weight.data, 0.)

thetas = torch.arange(self.n_heads, dtype=torch.float32) * (2.0 * math.pi / self.n_heads)

grid_init = torch.stack([thetas.cos(), thetas.sin()], -1)

grid_init = (grid_init / grid_init.abs().max(-1, keepdim=True)[0]).view(self.n_heads, 1, 1, 2).repeat(1, self.n_levels, self.n_points, 1)

# 同一特征层中不同采样点的坐标偏移不同，从图形上看，形成的偏移位置相当于3x3正方形卷积核去除中心，中心是参考点

for i in range(self.n_points):

grid_init[:, :, i, :] *= i + 1

with torch.no_grad():

self.sampling_offsets.bias = nn.Parameter(grid_init.view(-1))

# 初始化注意力权重

constant_(self.attention_weights.weight.data, 0.)

constant_(self.attention_weights.bias.data, 0.)

xavier_uniform_(self.value_proj.weight.data)

constant_(self.value_proj.bias.data, 0.)

xavier_uniform_(self.output_proj.weight.data)

constant_(self.output_proj.bias.data, 0.)

def forward(self, query, reference_points, input_flatten, input_spatial_shapes, input_level_start_index, input_padding_mask=None):

"""

:param query (N, Length_{query}, C)

:param reference_points (N, Length_{query}, n_levels, 2), range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area

or (N, Length_{query}, n_levels, 4), add additional (w, h) to form reference boxes

:param input_flatten (N, \sum_{l=0}^{L-1} H_l \cdot W_l, C)

:param input_spatial_shapes (n_levels, 2), [(H_0, W_0), (H_1, W_1), ..., (H_{L-1}, W_{L-1})]

:param input_level_start_index (n_levels, ), [0, H_0*W_0, H_0*W_0+H_1*W_1, H_0*W_0+H_1*W_1+H_2*W_2, ..., H_0*W_0+H_1*W_1+...+H_{L-1}*W_{L-1}]

:param input_padding_mask (N, \sum_{l=0}^{L-1} H_l \cdot W_l), True for padding elements, False for non-padding elements

:return output (N, Length_{query}, C)

"""

N, Len_q, _ = query.shape

N, Len_in, _ = input_flatten.shape

assert (input_spatial_shapes[:, 0] * input_spatial_shapes[:, 1]).sum() == Len_in

value = self.value_proj(input_flatten)

# 将特征图mask过的地方填充为0

if input_padding_mask is not None:

value = value.masked_fill(input_padding_mask[..., None], float(0))

# 把value拆成8个head

value = value.view(N, Len_in, self.n_heads, self.d_model // self.n_heads)

# 预测采样点的坐标偏移

sampling_offsets = self.sampling_offsets(query).view(N, Len_q, self.n_heads, self.n_levels, self.n_points, 2)

# 预测采样点的注意力权重

attention_weights = self.attention_weights(query).view(N, Len_q, self.n_heads, self.n_levels * self.n_points)

# 每个query在每个head内，每个特征层内都采样4个特征点，即16个采样点，再对这16个采样点的注意力权重初始化

attention_weights = F.softmax(attention_weights, -1).view(N, Len_q, self.n_heads, self.n_levels, self.n_points)

# N, Len_q, n_heads, n_levels, n_points, 2

if reference_points.shape[-1] == 2: # one stage模式

offset_normalizer = torch.stack([input_spatial_shapes[..., 1], input_spatial_shapes[..., 0]], -1)

# 参考点 + 偏移量 / 特征层宽高 = 采样点

sampling_locations = reference_points[:, :, None, :, None, :] \

+ sampling_offsets / offset_normalizer[None, None, None, :, None, :]

elif reference_points.shape[-1] == 4: # two stage模式

# 偏移量归一化都0到1再乘以宽高的一半，再加上参考点的中心坐标，使得最后采样点位于box内

sampling_locations = reference_points[:, :, None, :, None, :2] \

+ sampling_offsets / self.n_points * reference_points[:, :, None, :, None, 2:] * 0.5

else:

raise ValueError(

'Last dim of reference_points must be 2 or 4, but get {} instead.'.format(reference_points.shape[-1]))

# 输入采样点，注意力权重和所有点value，根据采样点位置从所有点value中拿出对应的value，并且和对应的注意力权重进行加权求和

output = MSDeformAttnFunction.apply(

value, input_spatial_shapes, input_level_start_index, sampling_locations, attention_weights, self.im2col_step)

# 对输出进行线性映射

output = self.output_proj(output)

return output

2.2 Deformable Transformer Encoder-Decoder

Deformable Transformer Encoder-Decoder结构如下图所示： Deformable DETR的代码如下：

class DeformableTransformer(nn.Module):

def __init__(self, d_model=256, nhead=8,

num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=1024, dropout=0.1,

activation="relu", return_intermediate_dec=False,

num_feature_levels=4, dec_n_points=4, enc_n_points=4,

two_stage=False, two_stage_num_proposals=300):

super().__init__()

self.d_model = d_model # 256

self.nhead = nhead # 8

self.two_stage = two_stage

self.two_stage_num_proposals = two_stage_num_proposals # 300

encoder_layer = DeformableTransformerEncoderLayer(d_model, dim_feedforward,

dropout, activation,

num_feature_levels, nhead, enc_n_points)

self.encoder = DeformableTransformerEncoder(encoder_layer, num_encoder_layers)

decoder_layer = DeformableTransformerDecoderLayer(d_model, dim_feedforward,

dropout, activation,

num_feature_levels, nhead, dec_n_points)

self.decoder = DeformableTransformerDecoder(decoder_layer, num_decoder_layers, return_intermediate_dec)

# 因为deformable detr用到了多尺度特征经过backbone会生成4个不同尺度的特征图但是如果还是使用原先的sine position embedding

# detr是针对h和w进行编码的不同位置的特征点会对应不同的编码值但是deformable detr不同的特征图的不同位置就有可能会产生相同的位置编码

# 为了解决这个问题，这里引入level_embed这个遍历不同层的特征图会有不同的level_embed 再让原先的每层位置编码+每层的level_embed

self.level_embed = nn.Parameter(torch.Tensor(num_feature_levels, d_model))

if two_stage:

self.enc_output = nn.Linear(d_model, d_model)

self.enc_output_norm = nn.LayerNorm(d_model)

self.pos_trans = nn.Linear(d_model * 2, d_model * 2)

self.pos_trans_norm = nn.LayerNorm(d_model * 2)

else:

self.reference_points = nn.Linear(d_model, 2)

self._reset_parameters()

def forward(self, srcs, masks, pos_embeds, query_embed=None):

assert self.two_stage or query_embed is not None

# prepare input for encoder

src_flatten = []

mask_flatten = []

lvl_pos_embed_flatten = []

spatial_shapes = []

for lvl, (src, mask, pos_embed) in enumerate(zip(srcs, masks, pos_embeds)):

bs, c, h, w = src.shape

spatial_shape = (h, w) # 特征图shape

spatial_shapes.append(spatial_shape)

src = src.flatten(2).transpose(1, 2)

mask = mask.flatten(1)

# pos_embed: detr的位置编码仅仅可以区分h,w的位置

pos_embed = pos_embed.flatten(2).transpose(1, 2)

# 每一层所有位置加上相同的level_embed 且不同层的level_embed不同

lvl_pos_embed = pos_embed + self.level_embed[lvl].view(1, 1, -1)

lvl_pos_embed_flatten.append(lvl_pos_embed)

src_flatten.append(src)

mask_flatten.append(mask)

# 不同尺度的src、mask、pos_embed进行concate

src_flatten = torch.cat(src_flatten, 1)

mask_flatten = torch.cat(mask_flatten, 1)

lvl_pos_embed_flatten = torch.cat(lvl_pos_embed_flatten, 1)

spatial_shapes = torch.as_tensor(spatial_shapes, dtype=torch.long, device=src_flatten.device)

# 不同尺度特征图对应被flatten的那个维度的起始索引

level_start_index = torch.cat((spatial_shapes.new_zeros((1, )), spatial_shapes.prod(1).cumsum(0)[:-1]))

# 不同尺度特征图中非padding部分的边长占其边长的比例

valid_ratios = torch.stack([self.get_valid_ratio(m) for m in masks], 1)

# encoder

# 将输入数据展平，输入Endoer中学习各个位置的相似度，增强输入的特征

memory = self.encoder(src_flatten, spatial_shapes, level_start_index, valid_ratios, lvl_pos_embed_flatten, mask_flatten)

# prepare input for decoder

bs, _, c = memory.shape

if self.two_stage: # two stage模式

# 对memory进行处理得到output_memory

output_memory, output_proposals = self.gen_encoder_output_proposals(memory, mask_flatten, spatial_shapes)

# hack implementation for two-stage Deformable DETR

# 分类头，注意这里直接使用的是decoder的最后一层预测头

enc_outputs_class = self.decoder.class_embed[self.decoder.num_layers](output_memory)

# 回归头

enc_outputs_coord_unact = self.decoder.bbox_embed[self.decoder.num_layers](output_memory) + output_proposals

topk = self.two_stage_num_proposals

# 直接用第一个类别的预测结果来计算top-k，因此使用two stage模式就不能进行预测头的参数共享，否则会导致二阶段输出都偏向于第一个类别

topk_proposals = torch.topk(enc_outputs_class[..., 0], topk, dim=1)[1]

# top 300个分类得分最高的index对应预测box

topk_coords_unact = torch.gather(enc_outputs_coord_unact, 1, topk_proposals.unsqueeze(-1).repeat(1, 1, 4))

topk_coords_unact = topk_coords_unact.detach() # 以先验框存在，取消梯度

reference_points = topk_coords_unact.sigmoid() # 得到归一化参考点坐标，作为decoder中作为初始化的参考点

init_reference_out = reference_points

# 对top-k proposal box进行位置编码

pos_trans_out = self.pos_trans_norm(self.pos_trans(self.get_proposal_pos_embed(topk_coords_unact)))

query_embed, tgt = torch.split(pos_trans_out, c, dim=2)

else: # one stage模式

# 随机初始化query_embed，nn.Embedding(num_queries, hidden_dim*2)

query_embed, tgt = torch.split(query_embed, c, dim=1)

# 初始化query pos

query_embed = query_embed.unsqueeze(0).expand(bs, -1, -1)

# 初始化tgt

tgt = tgt.unsqueeze(0).expand(bs, -1, -1)

# 由query pos接一个全连接层再归一化后的参考点中心坐标 [bs, 300, 256] -> [bs, 300, 2]

reference_points = self.reference_points(query_embed).sigmoid()

init_reference_out = reference_points # 初始化的归一化参考点坐标

# decoder

hs, inter_references = self.decoder(tgt, reference_points, memory,

spatial_shapes, level_start_index, valid_ratios, query_embed, mask_flatten)

inter_references_out = inter_references

if self.two_stage:

return hs, init_reference_out, inter_references_out, enc_outputs_class, enc_outputs_coord_unact

return hs, init_reference_out, inter_references_out, None, None

在Encoder中，作者将Self-Attention Module全部更换为Deformable Attention Module，每个Encoder Layer都会不断学习特征层中每个位置和4个采样点的相关性，每个Encoder的输入输出都是相同分辨率的增强后的多尺度特征图。多尺度特征图直接来自于ResNet的最后三个Stage。如下图所示：除此之外，在保留Positional Embedding的基础上，还添加了一个和特征层数相关可学习的Scale-Level Embedding。Encoder部分代码如下：

class DeformableTransformerEncoderLayer(nn.Module):

def __init__(self,

d_model=256, d_ffn=1024,

dropout=0.1, activation="relu",

n_levels=4, n_heads=8, n_points=4):

super().__init__()

# self attention

self.self_attn = MSDeformAttn(d_model, n_levels, n_heads, n_points)

self.dropout1 = nn.Dropout(dropout)

self.norm1 = nn.LayerNorm(d_model)

# ffn

self.linear1 = nn.Linear(d_model, d_ffn)

self.activation = _get_activation_fn(activation)

self.dropout2 = nn.Dropout(dropout)

self.linear2 = nn.Linear(d_ffn, d_model)

self.dropout3 = nn.Dropout(dropout)

self.norm2 = nn.LayerNorm(d_model)

@staticmethod

def with_pos_embed(tensor, pos):

return tensor if pos is None else tensor + pos

def forward_ffn(self, src):

src2 = self.linear2(self.dropout2(self.activation(self.linear1(src))))

src = src + self.dropout3(src2)

src = self.norm2(src)

return src

def forward(self, src, pos, reference_points, spatial_shapes, level_start_index, padding_mask=None):

# self attention

src2 = self.self_attn(self.with_pos_embed(src, pos), reference_points, src, spatial_shapes, level_start_index, padding_mask)

src = src + self.dropout1(src2)

src = self.norm1(src)

# ffn

src = self.forward_ffn(src)

return src

class DeformableTransformerEncoder(nn.Module):

def __init__(self, encoder_layer, num_layers):

super().__init__()

self.layers = _get_clones(encoder_layer, num_layers)

self.num_layers = num_layers

@staticmethod

def get_reference_points(spatial_shapes, valid_ratios, device):

# 生成参考点

reference_points_list = []

# 遍历4个特征图的shape

for lvl, (H_, W_) in enumerate(spatial_shapes):

# ref_y: [100, 150] 第一行：150个0.5 第二行：150个1.5 ... 第100行：150个99.5

# ref_x: [100, 150] 第一行：0.5 1.5...149.5 100行全部相同

ref_y, ref_x = torch.meshgrid(torch.linspace(0.5, H_ - 0.5, H_, dtype=torch.float32, device=device),

torch.linspace(0.5, W_ - 0.5, W_, dtype=torch.float32, device=device))

ref_y = ref_y.reshape(-1)[None] / (valid_ratios[:, None, lvl, 1] * H_)

ref_x = ref_x.reshape(-1)[None] / (valid_ratios[:, None, lvl, 0] * W_)

# 每一项都是xy

ref = torch.stack((ref_x, ref_y), -1)

reference_points_list.append(ref)

# 参考点，[bs, H/8*W/8+H/16*W/16+H/32*W/32+H/64*W/64, 2]

reference_points = torch.cat(reference_points_list, 1)

# 复制4份，每个特征点都有4个归一化的参考点，[bs, H/8*W/8+H/16*W/16+H/32*W/32+H/64*W/64, 4, 2]

reference_points = reference_points[:, :, None] * valid_ratios[:, None]

return reference_points

def forward(self, src, spatial_shapes, level_start_index, valid_ratios, pos=None, padding_mask=None):

# 多尺度特征图(4个flatten后的特征图) [bs, H/8 * W/8 + H/16 * W/16 + H/32 * W/32 + H/64 * W/64, 256]

output = src

# 4个flatten后特征图的归一化参考点坐标每个特征点有4个参考点xy坐标 [bs, H/8 * W/8 + H/16 * W/16 + H/32 * W/32 + H/64 * W/64, 4, 2]

reference_points = self.get_reference_points(spatial_shapes, valid_ratios, device=src.device)

for _, layer in enumerate(self.layers):

output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask)

# 经过6层encoder增强后的新特征每一层不断学习特征层中每个位置和4个采样点的相关性，最终输出的特征是增强后的特征图

return output

在Decoder中，作者保持了Self-Attention不变，而将Cross-Attention更换为了Deformable Attention，对于每一个Object Query Embedding，由线性层和Sigmoid学习出其对应的参考点并在Encoder输出的Feature上Query出对应的Value Embedding，最后基于由线性层和Softmax学习出的权重进行加权求和。第一次读到这里时，我有这么一个疑问：Object Query Embeding是一个预设的值，如果参考点位置和权重都是从Object Query Embedding推理的来的（DETR中都是通过和Encoder Feature关联得到的），那怎么能够保证检测的目标就是对的呢？后来仔细一想，第一层Decode Layer的参考点或者权重确实可能是一个固定值或随机值，但是在第一层Decode Layer输出的Object Query Embdding中就已经包含了Encoder Feature的信息，在第二层Decoder Layer生成的参考点位置就会和图像开始相关，而随着Decoder Layer的叠加，这个相关性还会越来越强。

此外，和DETR还有一点不同的是，Deformable DETR预测Bounding Box并不是从Object Query直接回归出绝对的坐标，而是回归出对参考点的距离，由于权重和参考点位置都是从同一个Object Query Embedding推理出来的，因此回归相对于参考点这样的作法能够加快收敛。Decoder部分代码如下：

class DeformableTransformerDecoderLayer(nn.Module):

def __init__(self, d_model=256, d_ffn=1024,

dropout=0.1, activation="relu",

n_levels=4, n_heads=8, n_points=4):

super().__init__()

# cross attention

# Decoder Layer和原始的DETR区别并不大，主要区别就是使用了MSDeformAttn

self.cross_attn = MSDeformAttn(d_model, n_levels, n_heads, n_points)

self.dropout1 = nn.Dropout(dropout)

self.norm1 = nn.LayerNorm(d_model)

# self attention

self.self_attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout)

self.dropout2 = nn.Dropout(dropout)

self.norm2 = nn.LayerNorm(d_model)

# ffn

self.linear1 = nn.Linear(d_model, d_ffn)

self.activation = _get_activation_fn(activation)

self.dropout3 = nn.Dropout(dropout)

self.linear2 = nn.Linear(d_ffn, d_model)

self.dropout4 = nn.Dropout(dropout)

self.norm3 = nn.LayerNorm(d_model)

@staticmethod

def with_pos_embed(tensor, pos):

return tensor if pos is None else tensor + pos

def forward_ffn(self, tgt):

tgt2 = self.linear2(self.dropout3(self.activation(self.linear1(tgt))))

tgt = tgt + self.dropout4(tgt2)

tgt = self.norm3(tgt)

return tgt

def forward(self, tgt, query_pos, reference_points, src, src_spatial_shapes, level_start_index, src_padding_mask=None):

# self attention

q = k = self.with_pos_embed(tgt, query_pos)

tgt2 = self.self_attn(q.transpose(0, 1), k.transpose(0, 1), tgt.transpose(0, 1))[0].transpose(0, 1)

tgt = tgt + self.dropout2(tgt2)

tgt = self.norm2(tgt)

# cross attention

tgt2 = self.cross_attn(self.with_pos_embed(tgt, query_pos),

reference_points,

src, src_spatial_shapes, level_start_index, src_padding_mask)

tgt = tgt + self.dropout1(tgt2)

tgt = self.norm1(tgt)

# ffn

tgt = self.forward_ffn(tgt)

return tgt

class DeformableTransformerDecoder(nn.Module):

def __init__(self, decoder_layer, num_layers, return_intermediate=False):

super().__init__()

self.layers = _get_clones(decoder_layer, num_layers)

self.num_layers = num_layers

self.return_intermediate = return_intermediate

# hack implementation for iterative bounding box refinement and two-stage Deformable DETR

self.bbox_embed = None

self.class_embed = None

def forward(self, tgt, reference_points, src, src_spatial_shapes, src_level_start_index, src_valid_ratios,

query_pos=None, src_padding_mask=None):

output = tgt

intermediate = [] # 6层输出的解码结果

intermediate_reference_points = [] # 6层输出的参考点（不断矫正）

for lid, layer in enumerate(self.layers):

if reference_points.shape[-1] == 4: # two stage模式

reference_points_input = reference_points[:, :, None] \

* torch.cat([src_valid_ratios, src_valid_ratios], -1)[:, None]

else: # one stage模式

assert reference_points.shape[-1] == 2 # one stage模式下参考点事query pose通过一个全连接层生成的2维tensor

reference_points_input = reference_points[:, :, None] * src_valid_ratios[:, None]

# decoder layer

output = layer(output, query_pos, reference_points_input, src, src_spatial_shapes, src_level_start_index, src_padding_mask)

# hack implementation for iterative bounding box refinement

# 如果使用iterative bounding box refinement，这里的self.bbox_embed就不是None，每层参考点就会根据上一层的数据结果进行矫正

# 如果不使用iterative bounding box refinement，这里的reference points就是不变的

if self.bbox_embed is not None:

# 将decoder每层解码的特征图接不共享参数的回归头，得到相对参考点的偏移量xy

# 然后再加上反归一化的参考点坐标，再进行sigmoid归一化得到矫正的参考点

tmp = self.bbox_embed[lid](output)

if reference_points.shape[-1] == 4: # two stage模式

new_reference_points = tmp + inverse_sigmoid(reference_points)

new_reference_points = new_reference_points.sigmoid()

else: # once stage模式

assert reference_points.shape[-1] == 2

new_reference_points = tmp

new_reference_points[..., :2] = tmp[..., :2] + inverse_sigmoid(reference_points)

new_reference_points = new_reference_points.sigmoid()

reference_points = new_reference_points.detach()

if self.return_intermediate:

intermediate.append(output)

intermediate_reference_points.append(reference_points)

# 默认返回6个decoder层输出一起计算损失

if self.return_intermediate:

# 第一项是[6, bs, 300, 256] 6层decoder输出

# 第二项是[6, bs, 300, 2] 6层decoder的参考点归一化中心坐标，但是如果是iterative bounding box refinement会不断学习迭代得到新的参考点6层一搬不同，否则是相同的

return torch.stack(intermediate), torch.stack(intermediate_reference_points)

return output, reference_points

2.3 Additional Improvement

在论文的附录部分还介绍了两个改进策略，如下：

2.3.1 Iterative Bounding Box Refinement

该策略指的是在被一层Decoder运行之后，都会将这层Decoder输出的Output送入非共享的Bounding Box Head中，根据当前预测得到的Bounding Box坐标对Reference Points进行修正，修正后的Reference Points作为先验送入下一层Decoder。即Iterative Bounding Box Refinement在每一层Decoder中Reference Points都是不同的，而原始的Deformable Decoder中每一层是相同的。需要注意的是：

Decoder每一层的检测头的参数是不同享的；校正后的Bounding Box梯度会被阻断（Detached），不会进行跨层传播；

2.3.2 Two Stage

该策略指的使用一个非参数共享的分类头和检测头对Encoder输出的特征进行预测，得到第一阶段的预测Proposals，然后选取Top-K个分数最高的Proposal作为Decoder的Reference Points。并且，Decoder的Object Query和Query Pos都是由Top-K的Reference Point的Positional Embedding通过MLP+LN处理生成。

以上两个策略的本质都是通过提升Decoder的参考点的质量来提升检测效果，其效果如下：可以看到，Iterative Bounding Box Refinement带来了1.6个点的提升，Two Stage带来了0.8个点的提升。

2.4 Conclusion

关于Deformable DETR的知识点远不止本文上面总结的两点，在Deformable DETR的论文还介绍了Two-Stage Defomable DETR等其他网络结构，还可以深入挖掘，这里我们来对比下Deformable DETR相对DETR的提升：以下是训练速度的对比，Deformable DETR的训练收敛速度明显提升了不少：从下面的表格中我们可以看到在小物体的检测精度上，Deformable提升了很多： DETR的提出主要是以其端到端的网络结构出圈，但是其性能可能在当时还没有达到SOTA，而在Deformable DETR的加持下，这类方法已经可以和SOTA方法一战了，对比结果如下：

3. Dynamic DETR

Dynamic DETR和Deformable DETR目的类似：为了解决特征分辨率小和训练速度慢的问题，但解决方案不同，Dynamic DETR取得了更快的速度和更好的效果，如下图所示：

Dynamic DETR的网络结构如下图所示：其中主要是由两部分组成，分别是Dynamic Encoder和Dynamic Decoder

3.1 Dynamic Encoder

输入Dynamic Encoder的是由Backbone提取的Pyramid Features，定义为

{

…

}

P=\left\{P_1, \ldots, P_k\right\}

P={P1,…,Pk}，在Dynamic Encoder中使用到了Deformable Convolution、SE Attention、Dynamic Relu等概念，首先使用Deformable Convolution对多尺度特征进行提取：

{

Upsample (DeformConv

(

−

)

DeformConv

⁡

(

)

Downsample (DeformConv

(

)

}

P_i^{+}=\left\{\text {Upsample (DeformConv }\left(P_{i-1}, s_i\right)\right),\operatorname{DeformConv}\left(P_i, s_i\right) ,\text { Downsample (DeformConv } \left.\left.\left(P_{i+1}, s_i\right)\right)\right\}

Pi+={Upsample (DeformConv (Pi−1,si)),DeformConv(Pi,si), Downsample (DeformConv (Pi+1,si))}其中，不同尺度的偏移量

s_i

si都是使用中间同一层特征计算获得

Offset

⁡

(

)

s_i=\operatorname{Offset}\left(P_i\right)

si=Offset(Pi)原因是这样可以避免不同尺度下偏移量不同的矛盾。然后再通过SE模块进行Attention操作：

⁡

(

)

w^{P_i}=\operatorname{SE}\left(P_i^{+}\right)

wPi=SE(Pi+)最后通过Dynamic Relu进行融合：

DyReLU

⁡

(

)

max

⁡

(

)

\operatorname{DyReLU}\left(x_c\right)=\max \left(a_c^1 x_c+b_c^1, a_c^2 x_c+b_c^2\right)

DyReLU(xc)=max(ac1xc+bc1,ac2xc+bc2)

Delta

⁡

(

)

a_c^1, b_c^1, a_c^2, b_c^2=\operatorname{Delta}\left(x_c\right)

ac1,bc1,ac2,bc2=Delta(xc)所以整个Dynamic Encoder的过程表达式为：

MultiScaleSelfAttn

(

)

Concat

⁡

…

(

DyReLU

⁡

(

)

\text { MultiScaleSelfAttn }(P)=\underset{i=1 \ldots k}{\operatorname{Concat}}\left(\operatorname{DyReLU}\left(w^{P_i} P_i^{+}\right)\right)

MultiScaleSelfAttn (P)=i=1…kConcat(DyReLU(wPiPi+))论文中说，Dynamic Endocer相对于Deformable DERT更加能够近似Attention在多尺度和多通道上的作用。

3.2 Dynamic Decoder

在Decoder中，我们使用Dynamic Convolution来代替Cross Attention Layer，首先论文使用了Box Encoding

∈

B \in \mathbb{R}^{q \times 4}

B∈Rq×4来代替Position Encoding，Box Encoding初始化为全图大小，然后使用ROI Pool从Encoder输出的特征中获取ROI特征：

RoIPool

⁡

(

enc

)

F=\operatorname{RoIPool}\left(P_{\text {enc }}, B, r\right)

F=RoIPool(Penc ,B,r)其中

∈

F \in \mathbb{R}^{q \times r \times r \times d}

F∈Rq×r×r×d，

r为ROI特征的大小，

d为ROI特征的通道数。我们通过Query Embeding

∈

Q \in \mathbb{R}^{q \times d}

Q∈Rq×d从ROI特征中Query预测结果，Query Embeding需要经过Self Attention和Fully Connected Layer：

∗

MultiHeadSelfAttn

(

)

Q^*=\text { MultiHeadSelfAttn }(Q, Q, Q)

Q∗= MultiHeadSelfAttn (Q,Q,Q)

⁡

(

∗

)

W^Q=\operatorname{FC}\left(Q^*\right)

WQ=FC(Q∗)然后通过一个

1\times 1

1×1 Convolution在ROI特征中执行Query操作：

Conv

⁡

(

)

Q^F=\operatorname{Conv}_{1 \times 1}\left(F, W^Q\right)

QF=Conv1×1(F,WQ)通过FFN模块后输出下一轮Query Embedding

\hat{Q}

Q^，box的参数

\hat{B}

B^和对应的类别

\hat{C}

FFN

⁡

(

)

\hat{Q}=\operatorname{FFN}\left(Q^F\right)

Q^=FFN(QF)

ReLU

⁡

(

⁡

(

⁡

(

)

\hat{B}=\operatorname{ReLU}(\operatorname{LN}(\operatorname{FC}(\hat{Q})))

B^=ReLU(LN(FC(Q^)))

Softmax

⁡

(

⁡

(

)

\hat{C}=\operatorname{Softmax}(\operatorname{FC}(\hat{Q}))

C^=Softmax(FC(Q^))如下为各种SOTA方法的对比，可以看到Dynamic DETR在各类指标上都有由于DETR和Deformable DETR:

4. DETR3D

DETR3D是将DETR应用到自动驾驶领域，实现多相机输入BEV视角下的3D物体检测，如下图所示：其原理和上文提到的Deformable Transformer很像，下面简单先总结下，关于Transformer在BEV视角任务下的应用像后面另起一篇博客学习下。

网络的结构图下图所示：网络先通过ResNet和FPN对各路摄像头输入都提取多尺度的特征，也即是图中的Image Feature Extraction部分，这没啥好说的，接下来就是将多尺度的特征输入2D to 3D Feature Transformation进行进一步学习，下面详细介绍下这部分内容。

4.1 2D to 3D Transfomer

每路相机输入我们提取四个尺度的特征，分别记为

\mathcal{F}_{1}, \mathcal{F}_{2}, \mathcal{F}_{3}, \mathcal{F}_{4}

F1,F2,F3,F4，在论文的配置下是一共有六路相机输入：

{

…

}

⊂

\mathcal{F}_{k}=\left\{\boldsymbol{f}_{k 1}, \ldots, \boldsymbol{f}_{k 6}\right\} \subset \mathbb{R}^{H \times W \times C}

Fk={fk1,…,fk6}⊂RH×W×C。

在DETR3D算法中是没有基于Transfomer的Encoder部分的，而是上述提取的图像特征是直接输入了Decoder部分。原因我觉得应该就是Encoder部分计算量太大了，而在Decoder部分和DETR或者Deformable DETR一样，都是通过Object Query Embedding和输入的Feature进行Cross Attention，进而从最后输出的Object Query Embedding中回归出对应的类别和位置，在DETR3D中区别较大的是回归的3D空间下的位置，而输入的是2D图像的Feature，因此也就需要这样一个2D to 3D Transfomer的Decoder.

DETR3D的Decoder仍然是由Self-Attention和Cross-Attention构成，Self-Attention和DETR中的方法基本一致，主要作用就是保证各个Query之间知道对方在干嘛，避免重复提取相同的特征，而Cross-Attention则区别较大，下面我们来看下具体步骤：

首先通过一个独立的网络

ref

\Phi^{\text {ref }}

Φref 从Object Query Embedding中回归出一个3D位置，这和Deformable DETR的操作有些类似：

ℓ

(

ℓ

)

\boldsymbol{c}_{\ell i}=\Phi^{\mathrm{ref}}\left(\boldsymbol{q}_{\ell i}\right)

cℓi=Φref(qℓi)其中

ℓ

\boldsymbol{c}_{\ell i}

cℓi可以被认为是第

i个Box的中心位置。通过相机参数，将位置

ℓ

\boldsymbol{c}_{\ell i}

cℓi投影到各个相机的Feature上去获得Key Embedding和Value Embedding来Refine预测的Box的3D位置：

ℓ

∗

ℓ

⊕

\boldsymbol{c}_{\ell i}^{*}=\boldsymbol{c}_{\ell i} \oplus 1

cℓi∗=cℓi⊕1

ℓ

∗

\boldsymbol{c}_{\ell m i}=T_{m} \boldsymbol{c}_{\ell i}^{*}

cℓmi=Tmcℓi∗其中

T_{m}

Tm为相机参数。由于每一路输入都是多尺度的特征图，为了避免不同尺度特征分辨率的影响，采用双线性插值来对特征图进行插值，如果坐标落到图片外则补零：

ℓ

bilinear

(

ℓ

)

\boldsymbol{f}_{\ell k m i}=f^{\text {bilinear }}\left(\mathcal{F}_{k m}, \boldsymbol{c}_{\ell m i}\right)

fℓkmi=fbilinear (Fkm,cℓmi)其中

ℓ

\boldsymbol{f}_{\ell k m i}

fℓkmi是第

m个相机在第

l层第

k级第

i个点的特征。将以上特征相加，并最后加入Object Query Embedding中来进行Refinement：

ℓ

∑

ℓ

∑

ℓ

\boldsymbol{f}_{\ell i}=\frac{1}{\sum_{k} \sum_{m} \sigma_{\ell k m i}+\epsilon} \sum_{k} \sum_{m} \boldsymbol{f}_{\ell k m i} \sigma_{\ell k m i}

fℓi=∑k∑mσℓkmi+ϵ1k∑m∑fℓkmiσℓkmi

(

ℓ

)

ℓ

\boldsymbol{q}_{(\ell+1) i}=\boldsymbol{f}_{\ell i}+\boldsymbol{q}_{\ell i}

q(ℓ+1)i=fℓi+qℓi其中，

ℓ

\sigma_{\ell k m i}

σℓkmi表示参考投影点是否在图像平面外的二分值。这里其实就是相当于Cross Attention加权求和的步骤，在DETR中这一步是通过Query Embedding和Key Embedding点乘后的权重，加权Value Embedding的值来更新Query。在Deformable中则将点乘的步骤省略，直接通过Query Embedding进行回归权重，然后加权Value Embedding。而在这里，作者相当于将Value Embedding和Query Embedding相加，进一步省略了计算量。（这一部分是我根据论文的描述和其他博客的总结得到的结论，可能和代码实现还有不同，有误的话还请读者指正）在对以上操作迭代多次后，最终从Query Embeding中回归出类别和位置：

ℓ

(

ℓ

)

\hat{\boldsymbol{b}}_{\ell i}=\Phi_{\ell}^{\mathrm{reg}}\left(\boldsymbol{q}_{\ell i}\right)

b^ℓi=Φℓreg(qℓi)

ℓ

(

ℓ

)

\hat{c}_{\ell i}=\Phi_{\ell}^{\mathrm{cls}}\left(\boldsymbol{q}_{\ell i}\right)

c^ℓi=Φℓcls(qℓi)在训练阶段DETR3D计算每一层预测

ℓ

{

ℓ

…

ℓ

…

ℓ

∗

}

⊂

\hat{\mathcal{B}}_{\ell}=\left\{\hat{\mathrm{b}}_{\ell 1}, \ldots, \hat{\mathrm{b}}_{\ell \mathrm{j}}, \ldots, \hat{\mathrm{b}}_{\ell \mathrm{M} *}\right\} \subset \mathbb{R}^9

B^ℓ={b^ℓ1,…,b^ℓj,…,b^ℓM∗}⊂R9和

ℓ

{

ℓ

…

ℓ

…

ℓ

∗

}

⊂

\hat{\mathcal{C}}_{\ell}=\left\{\hat{\mathrm{c}}_{\ell 1}, \ldots, \hat{\mathrm{c}}_{\ell \mathrm{j}}, \ldots, \hat{\mathrm{c}}_{\ell \mathrm{M} *}\right\} \subset \mathbb{Z}

C^ℓ={c^ℓ1,…,c^ℓj,…,c^ℓM∗}⊂Z的损失。在推理阶段只使用了最后一层的输出。

4.2 Set-to-Set Loss

在DETR3D中同样采用的也是Set-to-Set Loss，对于3D Bounding Box定义如下，对于预测集合

(

ℓ

)

\left(\hat{\mathcal{B}}_{\ell}, \hat{\mathcal{C}}_{\ell}\right)

(B^ℓ,C^ℓ)和真值集合

(

)

(\mathcal{B}, \mathcal{C})

(B,C)，损失包含两部分，第一部分是针对类别标签的Focal Loss，第二部分是针对框参数L1 Loss，如下：

∗

argmin

⁡

∈

∑

−

{

≠

∅

}

(

)

(

)

{

∅

}

box

(

)

\sigma^*=\operatorname{argmin}_{\sigma \in \mathcal{P}} \sum_{j=1}^M-1_{\left\{c_j \neq \varnothing\right\} \hat{p}_{\sigma(j)}\left(c_j\right)}+1_{\left\{c_j=\varnothing\right\}} \mathcal{L}_{\text {box }}\left(b_j, \hat{b}_{\sigma(\mathrm{j})}\right)

σ∗=argminσ∈Pj=1∑M−1{cj=∅}p^σ(j)(cj)+1{cj=∅}Lbox (bj,b^σ(j))其中

\mathcal{P}

P表示排列的集合，

\sigma

σ表示其中一种排列，

(

)

(

)

\hat{p}_{\sigma(j)}\left(c_j\right)

p^σ(j)(cj)为第

j个类别预测为

c_j

cj的概率，

box

\mathcal{L}_{\text {box }}

Lbox 框参数的L1损失，上述公式的含义就是找到一种排列使得上述类别损失和框参数损失最小，找到排列后我们就进一步计算其Set-to-Set损失：

sup

∑

−

log

⁡

∗

(

)

(

)

{

∅

}

(

∗

(

)

\mathcal{L}_{\text {sup }}=\sum_{j=1}^N-\log \hat{p}_{\sigma^*(j)}\left(c_j\right)+1_{\left\{c_j=\varnothing\right\}} \mathcal{L}_{\mathrm{box}}\left(\boldsymbol{b}_j, \hat{\boldsymbol{b}}_{\sigma^*(j)}\right) .

Lsup =j=1∑N−logp^σ∗(j)(cj)+1{cj=∅}Lbox(bj,b^σ∗(j)).

这其中有个一点是求框参数损失时限定的范围时

∅

c_j=\varnothing

cj=∅，在背景时才进行框参数损失的计算？这个问题可能需要看下源码才能确定。

本篇关于DETR的相关总结暂时到这，后面有时间再补充，如有问题欢迎指教，欢迎交流~

金钥匙

计算机视觉算法——基于Transformer的目标检测（DETR / Deformable DETR / Dynamic DETR / DETR 3D）

华为人机交互软件又添新成员，CarLink认证流程详解

目标检测人工智能：卷积神经网络及YOLO算法入门详解与综述（二）

发表评论取消回复

金钥匙

计算机视觉算法——基于Transformer的目标检测（DETR / Deformable DETR / Dynamic DETR / DETR 3D）

华为 人机交互软件又添新成员，CarLink认证流程详解

目标检测 人工智能：卷积神经网络及YOLO算法 入门详解与综述（二）

相关文章

发表评论取消回复

华为人机交互软件又添新成员，CarLink认证流程详解

目标检测人工智能：卷积神经网络及YOLO算法入门详解与综述（二）