目录

Abstract

1. Introduction

2. Object detection with R-CNN

2. R-CNN进行物体检测

2.1. Module design

2.1. 模块设计

2.2. Test-time detection

2.2. 测试时间检测

2.3. Training

2.4. Results on PASCAL VOC 2010-12

2.4. PASCAL VOC 2010-12的结果

2.5. Results on ILSVRC2013 detection

3. Visualization, ablation, and modes of error

3.1. Visualizing learned features

3.2. Ablation studies

3.3. Detection error analysis

3.3.检测误差分析

3.4. Bounding box regression

3.4. 边界框回归

4. Semantic segmentation

4. 语义分割

​编辑

5. Conclusion

5. 结论

Abstract

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper , we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/ ˜rbg/rcnn.

过去几年里,在标准PASCAL VOC数据集上,物体检测性能已经趋于稳定。性能最好的方法是复杂的集成系统,通常将多个低级图像特征与高级上下文相结合。在本文中,我们提出了一种简单且可扩展的检测算法,其均值平均精度(mAP)比之前VOC 2012的最佳结果提高了30%以上,达到了53.3%。我们的方法结合了两个关键见解:

(1)可以将高容量卷积神经网络(CNNs)应用于自底向上的区域建议,以定位和分割对象;

(2)当标记训练数据稀缺时,辅助任务的监督预训练,然后进行特定领域的精细调优,产生显著的性能提升。

由于我们将区域建议与CNN结合在一起,我们将方法称为R-CNN:具有CNN特征的区域。我们还提出了实验,提供洞察网络学习,揭示图像特征的丰富层次。完整系统的源代码可在Page Not Found | EECS at UC Berkeley获得。

1. Introduction

Features matter. The last decade of progress on various visual recognition tasks has been based considerably on the use of SIFT [26] and HOG [7]. But if we look at performance on the canonical visual recognition task, PASCAL VOC object detection [12], it is generally acknowledged that progress has been slow during 2010-2012, with small gains obtained by building ensemble systems and employing minor variants of successful methods.

功能问题。过去十年在各种视觉识别任务上取得的进展在很大程度上是基于使用SIFT[26]和HOG[7]。但如果我们看看规范的视觉识别任务PASCAL VOC对象检测[12]的性能,就会发现在2010-2012年期间进展缓慢,通过构建集成系统和采用成功方法的小变体获得了少量收益。

SIFT and HOG are blockwise orientation histograms, a representation we could associate roughly with complex cells in V1, the first cortical area in the primate visual pathway. But we also know that recognition occurs several stages downstream, which suggests that there might be hierarchical, multi-stage processes for computing features that are even more informative for visual recognition.

SIFT和HOG是块方向直方图,我们可以将其大致与灵长类视觉通路的第一个皮层区域V1中的复杂细胞相关联。但我们也知道,识别发生在下游的几个阶段,这表明计算特征可能存在分层的、多阶段的过程,这些过程对视觉识别的信息量更大。

【注】 1. PASCAL VOC:世界级的计算机视觉挑战赛。

Figure 1: Object detection system overview. Our system (1) takes an input image, (2) extracts around 2000 bottom-up region proposals, (3) computes features for each proposal using a large convolutional neural network (CNN), and then (4) classifies each region using class-specific linear SVMs. R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. For comparison, [32] reports 35.1% mAP using the same region proposals, but with a spatial pyramid and bag-of-visual-words approach. The popular deformable part models perform at 33.4%.

图1:对象检测系统概述。

我们的系统

(1)获取一个输入图像

(2)提取大约2000个自底向上候选区域region proposals

(3)使用大型卷积神经网络(CNN)计算每个区域的特征,然后

(4)使用类特定的线性支持向量机SVM对每个区域进行分类。

R-CNN在PASCAL VOC 2010上达到了53.7%的平均精度(mAP)。作为比较,[32]报告35.1%的mAP使用相同的候选区域,但采用空间金字塔和视觉词袋方法。用现下热门的可变形部件模型的性能为33.4%。

inspired hierarchical and shift-invariant model for pattern recognition, was an early attempt at just such a process.The neocognitron, however, lacked a supervised training algorithm. LeCun et al [23] provided the missing algorithm by showing that stochastic gradient descent, via backpropagation, can train convolutional neural networks (CNNs), a class of models that extend the neocognitron.

Fukushima的“neocognitron,一种受生物学启发用于模式识别的 层次化、移动不变性模型,算是这方面最早的尝试。然而,neocognitron 缺乏监督训练算法。基于Rumelhart等人的研究,Lecun等人提出反向 传播的随机梯度下降(SGD)对训练卷积神经网络(CNNs)非常有效, CNNs被认为是neocognitron的一种扩展。

CNNs saw heavy use in the 1990s (e.g., [24]), but then fell out of fashion, particularly in computer vision, with the rise of support vector machines. In 2012, Krizhevsky et al[22] rekindled interest in CNNs by showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [9, 10].Their success resulted from training a large CNN on 1.2 million labeled images, together with a few twists on LeCun’s CNN (e.g., max(x, 0) rectifying non-linearities and “dropout” regularization).

cnn在20世纪90年代被大量使用(例如[24]),但随后随着支持向量机的兴起,特别是在计算机视觉领域,它就不再流行了。2012年,Krizhevsky等人[22]在ImageNet大规模视觉识别挑战(ILSVRC)中表现出显著更高的图像分类精度,重新点燃了人们对cnn的兴趣[9,10]。他们的成功得益于用120万张标记图像训练一个大型CNN,以及在LeCun的CNN上做了一些改进(例如,max(x, 0)校正非线性和“dropout”正则化)。

The significance of the ImageNet result was vigorously debated during the ILSVRC 2012 workshop. The central issue can be distilled to the following: To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge?

ImageNet结果的重要性在ILSVRC 2012年研讨会上得到了热烈的讨论。核心问题可以归结为以下几点:CNN在ImageNet上的分类结果在多大程度上推广到PASCAL VOC挑战上的对象检测结果?

We answer this question decisively by bridging the chasm between image classification and object detection.This paper is the first to show that a CNN can lead to dra1 matically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features.1Achieving this result required solving two problems: localizing objects with a deep network and training a high-capacity model with only a small quantity of annotated detection data.

我们通过弥合图像分类和目标检测差别,回答了这个问题。本论文是第一个说明在PASCAL VOC的物体检测任务上CNN比基于简单 的类似HOG特征的系统有大幅的性能提升。我们主要关注了两个问题:使用深度网络定位目标和在小规模的标注数据集上进行大型网络 模型的训练。

Unlike image classification, detection requires localizing (likely many) objects within an image. One approach frames localization as a regression problem. However, work from Szegedy et al [31], concurrent with our own, indicates that this strategy may not fare well in practice (they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method). An alternative is to build a sliding-window detector. CNNs have been used in this way for at least two decades, typically on constrained object categories, such as faces [28, 33] and pedestrians [29]. In order to maintain high spatial resolution, these CNNs typically only have two convolutional and pooling layers. We also considered adopting a sliding-window approach. However, units high up in our network, which has five convolutional layers, have very large receptive fields (195 × 195 pixels) and strides (32×32 pixels) in the input image, which makes precise localization within the sliding-window paradigm an open technical challenge.

与图像分类不同,目标检测需要定位图像中的目标(可能有多个)。 一个方法是将框定位看做是回归问题。但Szegedy等人的研究以及我 们自己的研究表明这种策略在实际应用中并不可行(在VOC2007上他们的mAP是30.5%,而我们的达到了58.5%)。另一种方法是使用滑动窗口检测器。通过这种方法使用CNNs至少已经有20年的时间了,通 常用于一些特定目标种类的检测,例如人脸检测、行人检测等。为了获得较高的空间分辨率,这些CNNs普遍采用了两个卷积层和两个池 化层。我们本来也考虑过使用滑动窗口的方法。但是由于我们网络有 5个卷积层,具有更深的层和更多的神经元,使得输入图片有非常大的感受野(195×195)和步长(32×32),这使得采用滑动窗口的精确定位方法充满挑战。

【注】目标检测定位图像目标的方法:

1.将定位框看作是回归问题

2.使用滑动窗口检测器

这俩方法都被否决了

Instead, we solve the CNN localization problem by operating within the “recognition using regions” paradigm, as argued for by Gu et al in [18]. At test-time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs. We use a simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s shape. Figure 1 presents an overview of our method and highlights some of our results. Since our system combines region proposals with CNNs, we dub the method R-CNN: Regions with CNN features.

为了解决CNN的定位,我们是通过操作”recognition using regions” 范式,这种方法已经成功用于目标检测和语义分隔。测试时,每张图片产生了接近2000个与类别无关的region proposal,然后分别通过CNN提取了一个固定长度的特征向量,最后使用特定类别的线性 SVM对每个region进行分类。不论region的形状,我们使用一种简单的方法(仿射图像变形)将每个region proposal转换成固定尺寸的大小作为CNN的输入。图1展示了我们方法的全貌并突出展示了一些实验结果。由于我们的模型结合了Region proposals和CNNs,所以把这种方法称为R-CNN,即Regions with CNN features。

【注】最终采用的解决定位的方法是:通过操作”recognition using regions” 范式

Our system is also quite efficient. The only class-specific computations are a reasonably small matrix-vector product and greedy non-maximum suppression. This computational property follows from features that are shared across all categories and that are also two orders of magnitude lowerdimensional than previously used region features (cf. [32]).

我们的系统也相当高效。唯一的类特定的计算是一个相当小的矩阵-向量乘积和贪婪的非最大抑制。这一计算属性来源于所有类别共享的特征,而且这些特征比以前使用的区域特征低两个数量级(cf.[32])。

【注】贪婪的非最大抑制:顾名思义,抑制一些不是极大值的元素非极大值抑制的含义就是对于重叠度较高的一部分同类候选框来说,去掉那些置信度较低的框,只保留置信度最大的那一个进行后面的流程,这里的重叠度高低与否是通过 NMS 阈值来判断的。

Understanding the failure modes of our approach is also critical for improving it, and so we report results from the detection analysis tool of Hoiem et al [20]. As an immediate consequence of this analysis, we demonstrate that a simple bounding box regression method significantly reduces mislocalizations, which are the dominant error mode.

分析我们方法的失败案例对进一步改进和提升很有帮助,所以我 们借助Hoiem等人的定位分析工具[23]做实验结果的报告和分析。作 为本次分析的直接结果,我们发现一个简单的边界框回归的方法会明显地降低错误定位问题,而错误定位是我们的模型系统的主要误差。

Before developing technical details, we note that because R-CNN operates on regions it is natural to extend it to the task of semantic segmentation. With minor modifications, we also achieve state-of-the-art results on the PASCAL VOC segmentation task, with an average segmentation accuracy of 47.9% on the VOC 2011 test set.

开发技术细节之前,我们注意到由于R-CNN是在推荐区域上进行 操作,所以可以很自然地扩展到语义分割任务上。经过微小的改动, 我们就在PASCAL VOC语义分割任务上达到了很有竞争力的结果,在 VOC 2011测试集上平均语义分割精度达到了47.9%。

【注】平均语义分割:

首先理解混淆矩阵,再理解语义分割。

参考链接:(6条消息) 【语义分割】评价指标:PA、CPA、MPA、IoU、MIoU详细总结和代码实现(零基础从入门到精通系列!)_你是猪猪猪儿虫的博客-CSDN博客

2. Object detection with R-CNN

Our object detection system consists of three modules.The first generates category-independent region proposals.These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of classspecific linear SVMs. In this section, we present our design decisions for each module, describe their test-time usage, detail how their parameters are learned, and show results on PASCAL VOC 2010-12.

2. R-CNN进行物体检测

我们的物体检测系统有三个模块构成。第一个模块产生类别无关的region proposals。这些proposals组成了一个模型可用的候选检测区域的集合。第二个模块是一个大型卷积神经网络,从每个region提取固定长度的特征向量。第三个模块是特定类别线性SVM的集合。这一节将展示每个模块的设计,并介绍它们的测试阶段的用法,以及一些参数学习的细节,并得出在PASCAL VOC 2010-12和ILSVRC2013上的检测结果。                             

2.1. Module design

2.1. 模块设计

Region proposals. A variety of recent papers offer methods for generating category-independent region proposals.Examples include: objectness [1], selective search [32], category-independent object proposals [11], constrained parametric min-cuts (CPMC) [5], multi-scale combinatorial grouping [3], and Cires ¸an et al [6], who detect mitotic cells by applying a CNN to regularly-spaced square crops, which are a special case of region proposals. While R-CNN is agnostic to the particular region proposal method, we use selective search to enable a controlled comparison with prior

候选区域。区域推荐(Region Proposals)。近来有很多研究都提出了生成类别无关的区域推荐的方法。比如objectness [1]、selective search [39]、 category-independent object proposals [14]、constrained parametric min-cuts (CPMC) [5]、multi-scale combinatorial grouping [3],以及Ciresan等 人提出的将CNN用在规律空间块裁剪上以检测有丝分裂细胞的方法, 也算是一种特殊的区域推荐类型。由于R-CNN对特定区域算法是不关心的,所以我们采用了选择性搜索以方便和之前的工作[39, 41]进行可控的比较。

Figure 2: Warped training samples from VOC 2007 train.

Feature extraction. We extract a 4096-dimensional feature vector from each region proposal using the Caffe [21] implementation of the CNN described by Krizhevsky et al [22]. Features are computed by forward propagating a mean-subtracted 227 × 227 RGB image through five convolutional layers and two fully connected layers. We refer readers to [21, 22] for more network architecture details.

图2:来自VOC 2007训练的扭曲训练样本。

特征提取。我们使用Krizhevsky等人[25]所描述的CNN(译者注: AlexNet)的一个Caffe[24]实现版本对每个推荐区域提取一个4096维度的特征向量。减去像素均值的277×277大小的RGB输入图像通过五个卷积层和两个全连接层,最终计算得到特征向量。读者可以参考[24, 25]获得更多的网络架构细节。

In order to compute features for a region proposal, we must first convert the image data in that region into a form that is compatible with the CNN (its architecture requires inputs of a fixed 227 × 227 pixel size). Of the many possible transformations of our arbitrary-shaped regions, we opt for the simplest. Regardless of the size or aspect ratio of the candidate region, we warp all pixels in a tight bounding box around it to the required size. Prior to warping, we dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box (we use p = 16). Figure 2 shows a random sampling of warped training regions. The supplementary material discusses alternatives to warping.

为了计算推荐区域的特征,首先需要将输入的图像数据进行转变, 使得推荐的区域变成CNN可以接受的方式(我们架构中的CNN只能接受像素宽高比为227*227固定大小的图像)。有很多种方法可以对我们任意形状的区域进行变换,我们选择了最简单的一种。无论候选区域是什么尺寸或者任意长宽比,我们将区域放入无缝的边框内变形到 希望的尺寸。变形之前,先放大紧边框以便在新的变形后的尺寸上保证变形图像上下文的p的像素都围绕在原始框上(我们使用p=16)。图 2展示了一些变形训练图像的例子。

2.2. Test-time detection

2.2. 测试时间检测

At test time, we run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). We warp each proposal and forward propagate it through the CNN in order to read off features from the desired layer. Then, for each class, we score each extracted feature vector using the SVM trained for that class. Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over-union (IoU) overlap with a higher scoring selected region larger than a learned threshold.

在测试阶段,在测试图像上使用selective search提取2000个推荐区域(所有实验中我们使用了selective search的加速版本)。对每一个推荐区域变形后通过CNN前向传播计算出特征。然后我们使用训练过特定类别的SVM给特征向量中的每个类别单独打分。然后给出一张图像中所有的打分区域,然后使用贪婪非最大值抑制算法(每个类别是独立进行的)舍弃那些与大于学习阈值更高得分的推荐区域有重叠 (intersection-overunion (IoU))的区域。

【贪婪非极大值抑制算法】:如图两个车有五个候选框,找出概率最高的候选框region,计算其他候选框和它之间的交并比IOU=A交B/A并B。选择一个阈值(例如0.5),(后面作者也说到这个阈值很重要)大于这个阈值的候选框直接删除。这样,region周边的(与region高度重合的)车辆候选框都被排除了。

Run-time analysis. Two properties make detection efficient. First, all CNN parameters are shared across all categories. Second, the feature vectors computed by the CNN are low-dimensional when compared to other common approaches, such as spatial pyramids with bag-of-visual-word encodings. The features used in the UVA detection system [32], for example, are two orders of magnitude larger than ours (360k vs. 4k-dimensional).

运行时分析。两个特性让检测变得很高效。首先,所有的CNN参数都是跨类别共享的。其次,通过CNN计算的特征向量维度相比其他常见方法(比如spatial pyramids with bag-of-visual-word encodings)计算特征的维度是很低的。例如,UVA检测系统[39]中使用的特征比我们的要多两个数量级(360k维相比于4k维)。

The result of such sharing is that the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) is amortized over all classes. The  only class-specific computations are dot products between features and SVM weights and non-maximum suppression.

这种共享的结果是,花费在计算区域提议和特性上的时间(GPU上13秒/张图像或CPU上53秒/张图像)被摊到所有类上。唯一与类相关的计算是特征与支持向量机权重和非最大抑制之间的点积。

In practice, all dot products for an image are batched into a single matrix-matrix product. The feature matrix is typically 2000×4096 and the SVM weight matrix is 4096×N, where N is the number of classes.

在实践中,一个图像的所有点积都被批处理成一个矩阵-矩阵乘积。特征矩阵通常是2000×4096,支持向量机权重矩阵是4096×N,其中N是类的数量。

This analysis shows that R-CNN can scale to thousands of object classes without resorting to approximate techniques, such as hashing. Even if there were 100k classes, the resulting matrix multiplication takes only 10 seconds on a modern multi-core CPU. This efficiency is not merely the result of using region proposals and shared features. The UVA system, due to its high-dimensional features, would be two orders of magnitude slower while requiring 134GB of memory just to store 100k linear predictors, compared to just 1.5GB for our lower-dimensional features.

这一分析表明,R-CNN可以扩展到数千个对象类,而无需求助于近似的技术,如哈希。即使有100k个类,在现代多核CPU上得到的矩阵乘法只需要10秒。这种效率不仅仅是使用区域建议和共享特性的结果。UVA系统,由于其高维特征,将会慢两个数量级,并且需要134GB的内存来存储100k线性预测器,相比之下,我们的低维特征只需要1.5GB。

It is also interesting to contrast R-CNN with the recent work from Dean et al on scalable detection using DPMs and hashing [8]. They report a mAP of around 16% on VOC 2007 at a run-time of 5 minutes per image when introducing 10k distractor classes. With our approach, 10k detectors can run in about a minute on a CPU, and because no approximations are made mAP would remain at 59% (Section 3.2).

更有趣的是R-CCN和最近Dean等使用了DPMs和hashing[8]进行大规模检测任务对比。当他们用了1万个干扰类时每五分钟可以处理一张图片,在VOC2007上的mAP能达到16%。我们的方法1万个检测器由于没有做近似,可以在CPU上一分钟跑完,达到59%的mAP(3.2 节)。

2.3. Training

Supervised pre-training. We discriminatively pre-trained the CNN on a large auxiliary dataset (ILSVRC 2012) with image-level annotations (i.e., no bounding box labels). Pretraining was performed using the open source Caffe CNN library [21]. In brief, our CNN nearly matches the performance of Krizhevsky et al [22], obtaining a top-1 error rate 2.2 percentage points higher on the ILSVRC 2012 validation set. This discrepancy is due to simplifications in the training process.

有监督预训练。我们仅使用图像级注释的大型辅助数据集 (ILSVRC2012分类任务)上有区别地预训练了CNN(该数据集没有边界框标签)。预训练采用了开源的Caffe CNN库[24]。简单地说,我 们的CNN十分接近krizhevsky等人网络的性能,在ILSVRC2012分类验证集上top-1错误率比他们高2.2%。差异主要来自于训练过程的简化。

Domain-specific fine-tuning. To adapt our CNN to the new task (detection) and the new domain (warped VOC windows), we continue stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals from VOC. Aside from replacing the CNN’s ImageNet-specific 1000-way classification layer with a randomly initialized 21-way classification layer (for the 20 VOC classes plus background), the CNN architecture is unchanged. We treat all region proposals with ≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives. We start SGD at a learning rate of 0.001 (1/10th of the initial pre-training rate), which allows fine-tuning to make progress while not clobbering the initialization. In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128. We bias the sampling towards positive windows because they are extremely rare compared to background.

特定领域的参数调优。为了让我们的CNN适应新的任务(即检测任务)和新的领域(变形后的推荐窗口),我们只使用变形后的推荐区域对CNN参数进行SGD训练。我们替掉了ImageNet特定的1000类分类层,换成了一个随机初始化的(N+1)类的分类层(其中N是目标类别数目,1代表背景),而卷积层架构没有改变。对于VOC,N=20,对于ILSVRC2013,N=200。对于所有的推荐区域,如果与真实标注框的IoU重叠大于等于0.5就认为该推荐区域代表的类是正例,否则就是负例。 SGD初始学习率为0.001(初始化预训练时的十分之一),这使得调优得以有效进行而不会破坏初始化的成果。每轮SGD迭代,我们统一使用32个正例窗口(跨所有类别)和96个背景窗口组成大小为128的 mini-batch。另外我们倾向于采样正例窗口,因为和背景相比他们很稀少。

Object category classifiers. Consider training a binary classifier to detect cars. It’s clear that an image region tightly enclosing a car should be a positive example. Similarly, it’s clear that a background region, which has nothing to do with cars, should be a negative example. Less clear is how to label a region that partially overlaps a car. We resolve this issue with an IoU overlap threshold, below which regions are defined as negatives. The overlap threshold, 0.3, was selected by a grid search over {0, 0.1, . . . , 0.5} on a validation set. We found that selecting this threshold carefully is important. Setting it to 0.5, as in [32], decreased mAP by 5 points. Similarly, setting it to 0 decreased mAP by 4 points. Positive examples are defined simply to be the ground-truth bounding boxes for each class.

目标类别分类器。思考一下检测汽车的二分类器。很显然,一个图像区域紧紧包裹着一辆汽车应该就是正例。相似的,背景区域应该看不到任何汽车,就是负例。较为不明晰的是怎样标注哪些只和汽车部分重叠的区域。我们使用IoU重叠阈值来解决这个问题,低于这个阈值的就是负例。这个阈值我们选择了0.3,是在验证集上基于{0, 0.1, … 0.5}通过网格搜索得到的。我们发现认真选择这个阈值很重要。 如果设置为0.5,如[39],可以提升mAP5个点,设置为0,就会降低4个点。正例就严格的是标注的框。

【注】应证上面说的在运用贪婪非极大值抑制时阈值很重要。

Once features are extracted and training labels are applied, we optimize one linear SVM per class. Since the training data is too large to fit in memory, we adopt the standard hard negative mining method [14, 30]. Hard negative mining converges quickly and in practice mAP stops increasing after only a single pass over all images.

一旦特征提取出来,就应用标签数据,然后优化每个类的线性 SVM。由于训练数据太大,难以装进内存,我们选择了标准的hard negative mining method(高难负例挖掘算法?用途就是解决正负例数量不均衡,而负例分散代表性又不够的问题)[17, 37]。 高难负例挖掘算法收敛很快,实践中只要经过一轮mAP就可以基本停止增加了。

In supplementary material we discuss why the positive and negative examples are defined differently in fine-tuning versus SVM training. We also discuss why it’s necessary to train detection classifiers rather than simply use outputs from the final layer (fc8) of the fine-tuned CNN.

 在补充材料中,我们讨论了为什么在微调和支持向量机训练中积极和消极例子的定义不同。我们还讨论了为什么必须训练检测分类器,而不是简单地使用来自经过微调的CNN的最后一层(fc8)的输出。

2.4. Results on PASCAL VOC 2010-12

2.4. PASCAL VOC 2010-12的结果

Following the PASCAL VOC best practices [12], we validated all design decisions and hyperparameters on the VOC 2007 dataset (Section 3.2). For final results on the VOC 2010-12 datasets, we fine-tuned the CNN on VOC 2012 train and optimized our detection SVMs on VOC 2012 trainval. We submitted test results to the evaluation server only once for each of the two major algorithm variants (with and without bounding box regression).

按照PASCAL VOC的最佳实践步骤,我们在VOC2007的数据集上验证了我们所有的设计思想和参数处理,对于在2010-2012数据库中, 我们在VOC2012上训练和优化了我们的支持向量机检测器,我们一种方法(带BBox和不带BBox)只提交了一次评估服务器。

【注】BBOX:边框回归

 Table 1: Detection average precision (%) on VOC 2010 test. R-CNN is most directly comparable to UV A and Regionlets since all methods use selective search region proposals. Bounding box regression (BB) is described in Section 3.4. At publication time, SegDPM was the top-performer on the PASCAL VOC leaderboard. †DPM and SegDPM use context rescoring not used by the other methods.

表1:VOC 2010试验检测平均精度(%)。R-CNN与UV A和regionlet最直接的比较是,因为所有的方法都使用选择性搜索(selective search)的区域建议(region proposals)。边界框回归(BB)在第3.4节中介绍。在发布时,SegDPM是PASCAL VOC排行榜上表现最好的。†DPM和SegDPM使用的是其他方法没有使用的context rescoring。

Table 1 shows complete results on VOC 2010. We compare our method against four strong baselines, including SegDPM [15], which combines DPM detectors with the output of a semantic segmentation system [4] and uses additional inter-detector context and image-classifier rescoring. The most germane comparison is to the UVA system from Uijlings et al [32], since our systems use the same re gion proposal algorithm. To classify regions, their method builds a four-level spatial pyramid and populates it with densely sampled SIFT, Extended OpponentSIFT, and RGBSIFT descriptors, each vector quantized with 4000-word codebooks. Classification is performed with a histogram intersection kernel SVM. Compared to their multi-feature, non-linear kernel SVM approach, we achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster (Section 2.2). Our method achieves similar performance (53.3% mAP) on VOC 2011/12 test.

表1展示了(本方法)在VOC2010的结果,我们将自己的方法同四种先进基准方法作对比,其中包括SegDPM,这种方法将DPM检测子与语义分割系统相结合并且使用附加的内核的环境和图片检测器打分。更加恰当的比较是同Uijling的UVA系统比较,因为我们的方法同样基于候选框算法。对于候选区域的分类,他们通过构建一个四层的金字塔,并且将之与SIFT模板结合,SIFT为扩展的OpponentSIFT和 RGB-SIFT描述子,每一个向量被量化为4000词的codebook。分类任务由一个交叉核的支持向量机承担,对比这种方法的多特征方法,非线性内核的SVM方法,我们在mAP达到一个更大的提升,从35.1%提升 至53.7%,而且速度更快。我们的方法在VOC2011/2012数据达到了相似的检测效果mAP53.3%。

2.5. Results on ILSVRC2013 detection

We ran R-CNN on the 200-class ILSVRC2013 detection dataset using the same system hyperparameters that we used for PASCAL VOC. We followed the same protocol of submitting test results to the ILSVRC2013 evaluation server only twice, once with and once without bounding-box regression.

我们使用与用于 PASCAL VOC相同的系统超参数在200类ILSVRC2013检测数据集上运行R-CNN。我们遵循相同的协议,仅将测试结果提交给ILSVRC2013评估服务器两次,一次带有边界框回归, 一次带没有边界框回归。

Figure 3 compares R-CNN to the entries in the ILSVRC 2013 competition and to the post-competition OverFeat result [34]. R-CNN achieves a mAP of 31.4%, which is significantly ahead of the second-best result of 24.3% from OverFeat. To give a sense of the AP distribution over classes, box plots are also presented and a table of perclass APs follows at the end of the paper in Table 8. Most of the competing submissions (OverFeat, NEC-MU, UvAEuvision, Toronto A, and UIUC-IFP) used convolutional neural networks, indicating that there is significant nuance in how CNNs can be applied to object detection, leading to greatly varying outcomes.In Section 4, we give an overview of the ILSVRC2013 detection dataset and provide details about choices that we made when running R-CNN on it.

图3将R-CNN与ILSVRC 2013竞赛中的参赛作品以及竞赛后的OverFeat结果进行了比较[34]。 R-CNN的mAP达到31.4%,大大超过了OverFeat的第二佳结果24.3%。 为了让您了解AP在各个类别中的分布情况,还提供了箱形图,并在表8的末尾列出了每个类别的AP。 大多数竞争者(OverFeat,NEC-MU,UvAEuvision,Toronto A, 和 UIUC-IFP)使用了卷积神经网络,这表明CNN如何应用于目标检测有很大的细微差别,导致结果差异很大。 在第4节中,我们概述了 ILSVRC2013检测数据集,并提供了有关在其上运行R-CNN时所做选择的详细信息。

【注】箱型图

 Figure 3: (Left) Mean average precision on the ILSVRC2013 detection test set. Methods preceeded by * use outside training data (images and labels from the ILSVRC classification dataset in all cases). (Right) Box plots for the 200 average precision values per method. A box plot for the post-competition OverFeat result is not shown because per-class APs are not yet available (per-class APs for R-CNN are in Table 8 and also included in the tech report source uploaded to arXiv.org; see R-CNN-ILSVRC2013-APs.txt). The red line marks the median AP , the box bottom and top are the 25th and 75th percentiles. The whiskers extend to the min and max AP of each method. Each AP is plotted as a green dot over the whiskers (best viewed digitally with zoom).

图 3 :(左图)ILSVRC2013 检测测试集的 mAP。*开头的方法使用外部训练数据(所有方法都使用 ILSVRC 分类数据集中的图像和标签)。(右图)每种方法的 200 个平均精度值的箱形图。竞赛后的 OverFeat 结果的箱形图未显示,因为无法获得按类别的AP(R-CNN 按类别的AP在表8中,并且也包含在上传到 arXiv.org 的技术报告资源中;详细见:R-CNN-ILSVRC2013-APs.txt)。红线标记AP的中位数,方框的底部和顶部分别是第25个和第75个百分点。whiskers 扩展到每种方法的 AP 最小值和最大值。将每个AP绘制为 whiskers 上的绿点(最好通过电子版缩放进行查看)(译者注:需要将右图放大看,否则可能看不清楚)。

3. Visualization, ablation, and modes of error

3.1. Visualizing learned features

First-layer filters can be visualized directly and are easy to understand [22]. They capture oriented edges and opponent colors. Understanding the subsequent layers is more challenging. Zeiler and Fergus present a visually attractive deconvolutional approach in [36]. We propose a simple (and complementary) non-parametric method that directly shows what the network learned.

直接可视化第一层特征过滤器非常容易理解[25],它们主要捕获方向性边缘和对比色。难以理解的是后面的层。Zeiler and Fergus提出了一种可视化的很棒的反卷积办法[42]。我们则使用了一种简单的非参数化方法,直接展示网络学到的东西。这个想法是单一输出网络中一个特定单元(特征),然后把它当做一个正确类别的物体检测器来使用。

The idea is to single out a particular unit (feature) in the network and use it as if it were an object detector in its own right. That is, we compute the unit’s activations on a large set of held-out region proposals (about 10 million), sort the proposals from highest to lowest activation, perform nonmaximum suppression, and then display the top-scoring regions. Our method lets the selected unit “speak for itself” by showing exactly which inputs it fires on. We avoid averaging in order to see different visual modes and gain insight into the invariances computed by the unit.

方法是这样的,先计算所有抽取出来的推荐区域(大约1000万), 计算每个区域所导致的对应单元的激活值,然后按激活值对这些区域进行排序,然后进行最大值抑制,最后展示分值最高的若干个区域。 这个方法让被选中的单元在遇到他想激活的输入时“自己说话”。我们避免平均化是为了看到不同的视觉模式和深入观察单元计算出来的不变性。

We visualize units from layer pool5, which is the maxpooled output of the network’s fifth and final convolutional layer. The pool5 feature map is 6 × 6 × 256 = 9216dimensional. Ignoring boundary effects, each pool5 unit has a receptive field of 195×195 pixels in the original 227×227 pixel input. A central pool5 unit has a nearly global view, while one near the edge has a smaller, clipped support.

我们将来自层pool5的单元可视化,这是网络的第五个也是最后一个卷积层的maxpooled输出。pool5特征图为6 × 6 × 256 = 9216维。忽略边界效应,每个pool5单元在原始的227×227像素输入中有一个195×195像素的接收域。中央的pool5单元几乎可以看到全局视图,而靠近边缘的单元则有一个较小的剪切支架。

Figure 3: Top regions for six pool5 units. Receptive fields and activation values are drawn in white. Some units are aligned to concepts, such as people (row 1) or text (4). Other units capture texture and material properties, such as dot arrays (2) and specular reflections (6).

图3:六个pool5单元的顶部区域。接受域和激活值用白色表示。一些单元与概念对齐,如人(第1行)或文本(4)。其他单元捕获纹理和材料属性,如点阵列(2)和镜面反射(6)。

Table 2: Detection average precision (%) on VOC 2007 test. Rows 1-3 show R-CNN performance without fine-tuning. Rows 4-6 show results for the CNN pre-trained on ILSVRC 2012 and then fine-tuned (FT) on VOC 2007 trainval. Row 7 includes a simple bounding box regression (BB) stage that reduces localization errors (Section 3.4). Rows 8-10 present DPM methods as a strong baseline. The first uses only HOG, while the next two use different feature learning approaches to augment or replace HOG.

表2:VOC 2007测试的平均检测精度(%)。第1-3行显示R-CNN性能,没有进行微调。第4-6行显示了CNN在ILSVRC 2012上预训练的结果,然后在VOC 2007训练上微调(FT)的结果。第7行包括一个简单的边界框回归(BB)阶段,减少了定位误差(第3.4节)。第8-10行将DPM方法作为一个强基线。第一种方法只使用HOG,而后两种方法使用不同的特征学习方法来增强或替换HOG。 

Each row in Figure 3 displays the top 16 activations for a pool5 unit from a CNN that we fine-tuned on VOC 2007 trainval. Six of the 256 functionally unique units are visualized (the supplementary material includes more). These units were selected to show a representative sample of what the network learns. In the second row, we see a unit that fires on dog faces and dot arrays. The unit corresponding to the third row is a red blob detector. There are also detectors for human faces and more abstract patterns such as text and triangular structures with windows. The network appears to learn a representation that combines a small number of class-tuned features together with a distributed representation of shape, texture, color, and material properties. The subsequent fully connected layer fc6 has the ability to model a large set of compositions of these rich features.

图3中的每一行显示了我们在VOC 2007培训活动中微调的CNN中pool5单元的前16个激活。256个功能独特的单元中的6个是可视化的(补充材料包括更多)。这些单元被选择来展示网络学习的代表性样本。在第二行中,我们看到一个向狗的脸和点数组发射的单元。第三行对应的单元是一个红色斑点探测器。也有针对人脸和更抽象的模式(如文本和带有窗口的三角形结构)的探测器。该网络似乎学习了一种将少量类调优特征与形状、纹理、颜色和材料属性的分布式表示相结合的表示。随后的全连接层fc6有能力对这些丰富特征的大量组成进行建模。

3.2. Ablation studies

Performance layer-by-layer, without fine-tuning. To understand which layers are critical for detection performance, we analyzed results on the VOC 2007 dataset for each of the CNN’s last three layers. Layer pool5 was briefly described in Section 3.1. The final two layers are summarized below.Layer fc6 is fully connected to pool5. To compute fea tures, it multiplies a 4096×9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases. This intermediate vector is component-wise half-wave rectified (x ← max(0, x)).Layer fc7 is the final layer of the network. It is implemented by multiplying the features computed by fc6 by a 4096 × 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification.

一层一层的性能,不进行微调。为了了解哪些层对检测性能至关重要,我们分析了VOC 2007数据集上CNN最后三个层的结果。Layer pool5在3.1节中有简要介绍。最后两层总结如下。

fc6层完全连接到pool5。为了计算特征,它将4096×9216权重矩阵乘以pool5特征映射(重新塑造为9216维向量),然后添加一个偏差向量。这个中间向量是按分量整流的半波(x←max(0, x))。

fc7层是网络的最后一层。它是通过将fc6计算出的特征乘以一个4096 × 4096的权重矩阵来实现的,并类似地添加一个偏差向量并应用半波整流。

We start by looking at results from the CNN without fine-tuning on PASCAL, i.e. all CNN parameters were pretrained on ILSVRC 2012 only. Analyzing performance layer-by-layer (Table 2 rows 1-3) reveals that features from fc7 generalize worse than features from fc6. This means that 29%, or about 16.8 million, of the CNN’s parameters can be removed without degrading mAP . More surprising is that removing both fc7 and fc6 produces quite good results even though pool5 features are computed using only 6% of the CNN’s parameters. Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers. This finding suggests potential utility in computing a dense feature map, in the sense of HOG, of an arbitrary-sized image by using only the convolutional layers of the CNN. This representation would enable experimentation with sliding-window detectors, including DPM, on top of pool5 features.

我们首先查看未经PASCAL微调的CNN结果,即所有CNN参数仅在ILSVRC 2012上预训练。逐层分析性能(表2行1-3)显示fc7的特性比fc6的特性泛化得更差。这意味着,在不降低mAP的情况下,CNN的29%(约1680万)参数可以被删除。更令人惊讶的是,除去fc7和fc6会产生相当好的结果,即使pool5特征只使用CNN的6%的参数来计算。CNN的大部分表现能力来自它的卷积层,而不是更大的紧密连接层。这一发现表明,仅使用CNN的卷积层计算任意大小图像的密集特征图(HOG)具有潜在的实用价值。这种表示方式将支持在pool5特性之上使用滑动窗口检测器(包括DPM)进行实验。

Comparison to recent feature learning methods. Relatively few feature learning methods have been tried on PASCAL VOC detection. We look at two recent approaches that build on deformable part models. For reference, we also include results for the standard HOG-based DPM [17].

与最近的特征学习方法的比较。在PASCAL VOC检测中使用的特征学习方法相对较少。我们来看两种建立在可变形零件模型上的最新方法。作为参考,我们还包括了标准的基于hog的DPM[17]的结果。

The first DPM feature learning method, DPM ST [25], augments HOG features with histograms of “sketch token” probabilities. Intuitively, a sketch token is a tight distribution of contours passing through the center of an image patch. Sketch token probabilities are computed at each pixel by a random forest that was trained to classify 35 × 35 pixel patches into one of 150 sketch tokens or background.

第一个DPM的特征学习方法,DPM ST,将HOG中加入略图表征的概率直方图。直观的,一个略图就是通过图片中心轮廓的狭小分布。 略图表征概率通过一个被训练出来的分类35*35像素路径为一个150 略图表征的的随机森林方法计算。

The second method, DPM HSC [27], replaces HOG with histograms of sparse codes (HSC). To compute an HSC, sparse code activations are solved for at each pixel using a learned dictionary of 100 7 × 7 pixel (grayscale) atoms.

第二种方法DPM HSC[27]用稀疏编码直方图(histogram of sparse codes, HSC)代替HOG。为了计算一个HSC,稀疏代码激活在每个像素处使用100个7 × 7像素(灰度)原子的学习字典求解。

The resulting activations are rectified in three ways (full and both half-waves), spatially pooled, unit ‘2 normalized, and then power transformed (x ← sign(x)|x|α).

产生的激活以三种方式进行整流(全波和两个半波),空间池化,单元' 2归一化,然后功率转换(x←0 (x)|x|α)。

All R-CNN variants strongly outperform the three DPM baselines (Table 2 rows 8-10), including the two that use feature learning. Compared to the latest version of DPM, which uses only HOG features, our mAP is more than 20 percentage points higher: 54.2% vs. 33.7%—a 61% relative improvement. The combination of HOG and sketch tokens yields 2.5 mAP points over HOG alone, while HSC improves over HOG by 4 mAP points (when compared internally to their private DPM baselines—both use nonpublic implementations of DPM that underperform the open source version [17]). These methods achieve mAPs of 29.1% and 34.3%, respectively.

所有R-CNN变体的性能都明显优于三个DPM基线(表2行8-10),包括两个使用特性学习的基线。与只使用HOG特性的最新版本DPM相比,我们的mAP提高了20多个百分点:54.2% vs. 33.7%——相对提高了61%。HOG和草图令牌的组合比HOG单独产生2.5个mAP点,而HSC比HOG提高了4个mAP点(当与它们的私有DPM基线内部比较时——两者都使用DPM的非公共实现,其性能低于开源版本[17])。这些方法的map分别达到29.1%和34.3%。

3.3. Detection error analysis

3.3.检测误差分析

We applied the excellent detection analysis tool from Hoiem et al [20] in order to reveal our method’s error modes, understand how fine-tuning changes them, and to  see how our error types compare with DPM. A full summary of the analysis tool is beyond the scope of this paper and we encourage readers to consult [20] to understand some finer details (such as “normalized AP”). Since the analysis is best absorbed in the context of the associated plots, we present the discussion within the captions of Figure 4 and Figure 5.

我们应用了来自Hoiem等人[20]的优秀检测分析工具,以揭示我们方法的错误模式,理解微调如何改变它们,并查看我们的错误类型与DPM的比较。分析工具的完整总结超出了本文的范围,我们鼓励读者咨询[20]以了解更详细的细节(如“规范化AP”)。由于分析最好在相关的图的上下文中进行,我们在图4和图5的标题中进行讨论。

Figure 4: Distribution of top-ranked false positive (FP) types.Each plot shows the evolving distribution of FP types as more FPs are considered in order of decreasing score. Each FP is categorized into 1 of 4 types: Loc—poor localization (a detection with an IoU overlap with the correct class between 0.1 and 0.5, or a duplicate); Sim—confusion with a similar category; Oth—confusion with a dissimilar object category; BG—a FP that fired on background. Compared with DPM (see [20]), significantly more of our errors result from poor localization, rather than confusion with background or other object classes, indicating that the CNN features are much more discriminative than HOG. Loose localization likely results from our use of bottom-up region proposals and the positional invariance learned from pre-training the CNN for whole-image classification. Column three shows how our simple bounding box regression method fixes many localization errors.

图4:排名靠前的假阳性(FP)类型的分布。每个图显示了FP类型的演变分布,因为更多的FPs被考虑到分数递减的顺序。每个FP被分为4种类型中的1种:loco - poor定位(IoU与正确的类在0.1到0.5之间重叠的检测,或副本);类似类别的模拟混淆;与不相似的对象类别混淆;BG-a FP在后台开火。与DPM(参见[20])相比,我们的错误明显更多地来自于糟糕的本地化,而不是与背景或其他对象类混淆,这表明CNN特征比HOG更具辨别能力。松散的定位可能是由于我们使用自底向上的区域建议和从CNN进行整幅图像分类的预训练中学到的位置不变性。第三列展示了我们的简单边界盒回归方法如何修复许多定位错误。

3.4. Bounding box regression

3.4. 边界框回归

Based on the error analysis, we implemented a simple method to reduce localization errors. Inspired by the bounding box regression employed in DPM [14], we train a linear regression model to predict a new detection window given the pool5 features for a selective search region proposal.Full details are given in the supplementary material. Results in Table 1, Table 2, and Figure 4 show that this simple approach fixes a large number of mislocalized detections, boosting mAP by 3 to 4 points. 

在误差分析的基础上,实现了一种减少定位误差的简单方法。受到DPM[14]中使用的边界盒回归的启发,我们训练了一个线性回归模型,以预测一个新的检测窗口,给出了一个选择性搜索区域提议的pool5特征。详细内容见补充材料。表1、表2和图4中的结果表明,这种简单的方法修复了大量的错误本地化检测,将mAP提高了3到4个点。

4. Semantic segmentation

4. 语义分割

 Figure 5: Sensitivity to object characteristics. Each plot shows the mean (over classes) normalized AP (see [20]) for the highest and lowest performing subsets within six different object characteristics (occlusion, truncation, bounding box area, aspect ratio, viewpoint, part visibility). We show plots for our method (R-CNN) with and without fine-tuning (FT) and bounding box regression (BB) as well as for DPM voc-release5. Overall, fine-tuning does not reduce sensitivity (the difference between max and min), but does substantially improve both the highest and lowest performing subsets for nearly all characteristics. This indicates that fine-tuning does more than simply improve the lowest performing subsets for aspect ratio and bounding box area, as one might conjecture based on how we warp network inputs.Instead, fine-tuning improves robustness for all characteristics including occlusion, truncation, viewpoint, and part visibility.

图5:对对象特征的敏感性。每个图显示了在6个不同的对象特征(遮挡、截断、包围框面积、纵横比、视点、部分可见性)中最高和最低表现子集的平均(超过类)归一化AP(见[20])。我们展示了我们的方法(R-CNN)与微调(FT)和边界盒回归(BB)以及DPM voc-release5的图。总的来说,微调不会降低灵敏度(最大值和最小值之间的差异),但会显著改善几乎所有特征的最高和最低性能子集。这表明,微调不仅仅是改善长宽比和包围框面积的性能最低的子集,正如人们基于我们如何扭曲网络输入所推测的那样。相反,微调提高了所有特征的鲁棒性,包括遮挡、截断、视点和部分可见性。

Region classification is a standard technique for semantic segmentation, allowing us to easily apply R-CNN to the PASCAL VOC segmentation challenge. To facilitate a direct comparison with the current leading semantic segmentation system (called O2P for “second-order pooling”) [4], we work within their open source framework. O2P uses CPMC to generate 150 region proposals per image and then predicts the quality of each region, for each class, using support vector regression (SVR). The high performance of their approach is due to the quality of the CPMC regions and the powerful second-order pooling of multiple feature types (enriched variants of SIFT and LBP). We also note that Farabet et al [13] recently demonstrated good results on several dense scene labeling datasets (not including PASCAL) using a CNN as a multi-scale per-pixel classifier.

区域分类是语义分割的一种标准技术,它使我们能够轻松地将R-CNN应用于PASCAL VOC细分挑战。为了便于与当前领先的语义分割系统(称为O2P,表示“二阶池”)[4]进行直接比较,我们在他们的开源框架中工作。O2P使用CPMC为每张图像生成150个区域建议,然后使用支持向量回归(SVR)预测每个类的每个区域的质量。他们方法的高性能是由于CPMC区域的质量和多种特征类型(SIFT和LBP的丰富变体)的强大二阶池化。我们还注意到Farabet等人[13]最近在使用CNN作为多尺度每像素分类器的几个密集场景标记数据集(不包括PASCAL)上展示了良好的结果。

We follow [2, 4] and extend the PASCAL segmentation training set to include the extra annotations made available by Hariharan et al [19]. Design decisions and hyperparameters were cross-validated on the VOC 2011 validation set.Final test results were evaluated only once.

我们遵循[2,4]并扩展PASCAL分割训练集,以包括Hariharan等人[19]提供的额外注释。设计决策和超参数在VOC 2011验证集上进行交叉验证。最终的测试结果只评估了一次。

CNN features for segmentation. We evaluate three strategies for computing features on CPMC regions, all of which begin by warping the rectangular window around the region to 227 × 227. The first strategy (full) ignores the region’s shape and computes CNN features directly on the warped window, exactly as we did for detection. However, these features ignore the non-rectangular shape of the region. Two regions might have very similar bounding boxes while having very little overlap. Therefore, the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction. The third strategy (full+fg) simply concatenates the full and fg features; our experiments validate their complementarity.

CNN feature用于分割。我们评估了三种计算CPMC区域特征的策略,所有这些策略都是从将该区域周围的矩形窗口扭曲为227 × 227开始的。第一种策略(全)忽略区域的形状,并直接在弯曲的窗口上计算CNN特征,与我们在检测时所做的完全相同。然而,这些特征忽略了区域的非矩形形状。两个区域可能有非常相似的边界框,但重叠很少。因此,第二种策略(fg)只计算区域前景掩码上的CNN特征。我们用均值输入代替背景,使背景区域在均值相减后为零。第三种策略(full+fg)只是将full和fg功能连接起来;我们的实验验证了它们的互补性。

Results on VOC 2011. Table 3 shows a summary of our results on the VOC 2011 validation set compared with O2P.(See supplementary material for complete per-category results.) Within each feature computation strategy, layer fc6 always outperforms fc7 and the following discussion refers to the fc6 features. The fg strategy slightly outperforms full, indicating that the masked region shape provides a stronger signal, matching our intuition. However, full+fg achieves an average accuracy of 47.9%, our best result by a margin of 4.2% (also modestly outperforming O2P), indicating that the context provided by the full features is highly informative even given the fg features. Notably, training the 20 SVRs on our full+fg features takes an hour on a single core, compared to 10+ hours for training on O2P features.

VOC 2011的结果。表3显示了我们对VOC 2011验证集与O2P验证集的结果总结。

(关于每个类别的完整结果,请参阅补充材料。)在每个特征计算策略中,fc6层总是优于fc7,下面的讨论参考fc6的特性。fg策略的表现略优于full策略,表明遮蔽区域形状提供了更强的信号,与我们的直觉相匹配。然而,full+fg的平均准确率为47.9%,我们的最佳结果是4.2%的差距(也略优于O2P),这表明,即使在fg特征的情况下,完整特征提供的上下文也具有很高的信息量。值得注意的是,训练20个svr使用我们的完整+fg功能在单个核心上需要一个小时,而在O2P功能上需要10多个小时。

In Table 4 we present results on the VOC 2011 test set, comparing our best-performing method, fc6 (full+fg), against two strong baselines. Our method achieves the highest segmentation accuracy for 11 out of 21 categories, and the highest overall segmentation accuracy of 47.9%, averaged across categories (but likely ties with the O2P result under any reasonable margin of error). Still better performance could likely be achieved by fine-tuning.

在表4中,我们展示了VOC 2011测试集的结果,将我们性能最好的方法fc6 (full+fg)与两个强基线进行比较。我们的方法在21个类别中的11个类别中获得了最高的分割精度,并且在类别之间的平均分割精度最高,达到了47.9%(但可能与O2P结果在任何合理的误差范围内相一致)。通过微调,仍有可能获得更好的性能。

5. Conclusion

5. 结论

In recent years, object detection performance had stagnated. The best performing systems were complex ensembles combining multiple low-level image features with high-level context from object detectors and scene classifiers. This paper presents a simple and scalable object detection algorithm that gives a 30% relative improvement over the best previous results on PASCAL VOC 2012.

近年来,目标检测性能一直处于停滞状态。表现最好的系统是将多个低级图像特征与来自物体检测器和场景分类器的高级背景相结合的复杂集成。本文提出了一种简单且可扩展的对象检测算法,该算法相对于PASCAL VOC 2012上的最佳结果有30%的改进。

 Table 4: Segmentation accuracy (%) on VOC 2011 test. We compare against two strong baselines: the “Regions and Parts” (R&P) method of [2] and the second-order pooling (O2P) method of [4]. Without any fine-tuning, our CNN achieves top segmentation performance, outperforming R&P and roughly matching O2P.

表4:VOC 2011测试的分割精度(%)。我们比较了两个强基线:[2]的“区域和部分”(R&P)方法和[4]的二阶池(O2P)方法。在没有任何微调的情况下,我们的CNN实现了顶级的细分性能,优于R&P,大致与O2P相当。

We achieved this performance through two insights. The first is to apply high-capacity convolutional neural networks to bottom-up region proposals in order to localize and segment objects. The second is a paradigm for training large CNNs when labeled training data is scarce. We show that it is highly effective to pre-train the network— with supervision—for a auxiliary task with abundant data (image classification) and then to fine-tune the network for the target task where data is scarce (detection). We conjecture that the “supervised pre-training/domain-specific finetuning” paradigm will be highly effective for a variety of data-scarce vision problems.

我们通过两个见解实现了这一性能。第一个是将大容量卷积神经网络应用于自底向上的区域提议,以定位和分割对象。第二个是在标记训练数据稀缺的情况下训练CNNs。我们表明,对数据丰富的辅助任务(图像分类)进行预先训练(带有监督),然后对数据稀缺的目标任务(检测)微调网络是非常有效的。我们推测,“有监督的训练前/特定领域的微调”范式对于各种数据稀缺的视觉问题将是非常有效的。

We conclude by noting that it is significant that we achieved these results by using a combination of classical tools from computer vision and deep learning (bottomup region proposals and convolutional neural networks).Rather than opposing lines of scientific inquiry, the two are natural and inevitable partners.

最后,我们指出,通过结合使用来自计算机视觉和深度学习的经典工具(自底向上区域提议和卷积神经网络),我们获得了这些结果,这是非常重要的。这两者并不是科学探索的对立方向,而是自然而不可避免的合作伙伴。

Acknowledgments. This research was supported in part by DARPA Mind’s Eye and MSEE programs, by NSF awards IIS-0905647, IIS-1134072, and IIS-1212798, MURI N000014-10-1-0933, and by support from Toyota.The GPUs used in this research were generously donated by the NVIDIA Corporation.

致谢 该研究部分得到了DARPA Mind’s Eye和MSEE项目的支持,得到了NSF奖项iss -0905647, iss -1134072和iss -1212798, MURI N000014-10-1-0933的支持,并得到了丰田的支持。本研究使用的gpu由NVIDIA公司慷慨捐赠。

好文链接

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: