马春杰杰博客
致力于深度学习经验分享!

YOLO v4 论文中英对照翻译 | YOLO v4全文翻译

文章目录
[隐藏]

YOLO V4终于有了衣钵传人,发布第一时间就拿来品尝了~下面是全文中英对照翻译,有时间进行一下精修!

未经允许,禁止转载!

2020 – YOLOv4 Optimal Speed and Accuracy of Object Detection

论文下载:https://arxiv.org/pdf/2004.10934.pdf

论文源码:https://github.com/AlexeyAB/darknet

Abstract

There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP 50 ) for the MS COCO dataset at a realtime speed of ∼65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet.
有大量的功能,据说提高卷积神经网络(CNN)的准确性。需要对大型数据集上此类要素的组合进行实际测试,以及结果的理论论证。某些功能仅针对某些模型,仅针对某些问题,或仅适用于小规模数据集;而某些功能(如批处理规范化和剩余连接)适用于大多数模型、任务和数据集。我们假设此类通用功能包括加权剩余连接 (WRC)、跨阶段-部分连接 (CSP)、交叉小批量规范化 (CmBN)、自对抗训练 (SAT) 和”密斯激活”。我们使用新功能:WRC、CSP、CmBN、SAT、Mish 激活、马赛克数据扩增、CmBN、DropBlock 正化和 CIoU 损耗,并结合其中一些功能实现最先进的结果:43.5% AP (65.7% AP 50),实时速度为特斯拉 V100 上的 MS COCO 数据集 [65 FPS。源代码位于https://github.com/AlexeyAB/darknet。

1 Introduction

The majority of CNN-based object detectors are largely applicable only for recommendation systems. For example, searching for free parking spaces via urban video cameras is executed by slow accurate models, whereas car collision warning is related to fast inaccurate models. Improving the real-time object detector accuracy enables using them not only for hint generating recommendation systems, but also for stand-alone process management and human input reduction. Real-time object detector operation on conventional Graphics Processing Units (GPU) allows their mass usage at an affordable price. The most accurate modern neural networks do not operate in real time and require large number of GPUs for training with a large mini-batch-size. We address such problems through creating a CNN that operates in real-time on a conventional GPU, and for which training requires only one conventional GPU.
大多数基于CNN的目标检测器大多仅适用于推荐系统。例如,通过城市摄像机搜索免费停车位是通过缓慢准确的模型执行的,而汽车碰撞警告则与快速不准确的模型有关。提高实时目标检测器精度,不仅可以将其用于提示生成建议系统,还可用于独立的过程管理和人工输入减少。传统图形处理单元 (GPU) 上的实时目标检测器操作允许以实惠的价格大规模使用。最精确的现代神经网络无法实时运行,并且需要大量的 GPU 进行大型小型批处理大小的训练。我们通过创建一个在传统 GPU 上实时运行的 CNN 来解决这些问题,为此训练只需要一个常规 GPU。

图1:对拟建的YOLOv4和其他最先进的目标检测器进行比较。YOLOv4 的速度比高效德特快两倍,具有同等的性能。YOLOv3 的 AP 和 FPS 分别提高了 10% 和 12%。图1:对拟建的YOLOv4和其他最先进的目标检测器进行比较。YOLOv4 的速度比高效德特快两倍,具有同等的性能。YOLOv3 的 AP 和 FPS 分别提高了 10% 和 12%。

The main goal of this work is designing a fast operating speed of an object detector in production systems and optimization for parallel computations, rather than the low computation volume theoretical indicator (BFLOP). We hope that the designed object can be easily trained and used. For example, anyone who uses a conventional GPU to train and test can achieve real-time, high quality, and convincing object detection results, as the YOLOv4 results shown in Figure 1. Our contributions are summarized as follows:
这项工作的主要目标是设计生产系统中目标检测器的快速运行速度,优化并行计算,而不是低计算量理论指标 (BFLOP)。我们希望设计的对象能够轻松训练和使用。例如,任何使用传统 GPU 进行训练和测试的人都可以获得实时、高质量和令人信服的目标检测结果,如图 1 所示的 YOLOv4 结果所示。我们的贡献总结如下:

  1. We develope an efficient and powerful object detection model. It makes everyone can use a 1080 Ti or 2080 Ti GPU to train a super fast and accurate object detector.
  2. 我们开发一种高效、强大的物体检测模型。它使每个人都可以使用1080 Ti或2080 Ti GPU来训练超快速和准确的目标检测器。
  3. We verify the influence of state-of-the-art Bag-ofFreebies and Bag-of-Specials methods of object detection during the detector training.
  4. 在检测器训练期间,我们验证最先进的免费包和特殊包检测方法的影响。
  5. We modify state-of-the-art methods and make them more effecient and suitable for single GPU training, including CBN [89], PAN [49], SAM [85], etc
  6. 我们修改最先进的方法,使其更高效,适合单次 GPU 训练,包括 CBN [89]、PAN [49]、SAM [85]等

Figure 2: Object detector.Figure 2: Object detector.

2 Related work

2.1 Object detection models

A modern detector is usually composed of two parts, a backbone which is pre-trained on ImageNet and a head which is used to predict classes and bounding boxes of objects. For those detectors running on GPU platform, their backbone could be VGG [68], ResNet [26], ResNeXt [86], or DenseNet [30]. For those detectors running on CPU platform, their backbone could be SqueezeNet [31], MobileNet [28, 66, 27, 74], or ShuffleNet [97, 53]. As to the head part, it is usually categorized into two kinds, i.e., one-stage object detector and two-stage object detector. The most representative two-stage object detector is the R-CNN [19] series, including fast R-CNN [18], faster R-CNN [64], R-FCN [9], and Libra R-CNN [58]. It is also possible to make a twostage object detector an anchor-free object detector, such as RepPoints [87]. As for one-stage object detector, the most representative models are YOLO [61, 62, 63], SSD [50], and RetinaNet [45]. In recent years, anchor-free one-stage object detectors are developed. The detectors of this sort are CenterNet [13], CornerNet [37, 38], FCOS [78], etc. Object detectors developed in recent years often insert some layers between backbone and head, and these layers are usually used to collect feature maps from different stages. We can call it the neck of an object detector. Usually, a neck is composed of several bottom-up paths and several top-down paths. Networks equipped with this mechanism include Feature Pyramid Network (FPN) [44], Path Aggregation Network (PAN) [49], BiFPN [77], and NAS-FPN [17].
现代检测器通常由两部分组成,一个是在 ImageNet 上预先训练的骨干网,另一个是用于预测物体的类和边界框的头部。对于在 GPU 平台上运行的检测器,其主干可以是 VGG [68]、ResNet [26]、ResNeXt [86]或密集网 [30]。对于在 CPU 平台上运行的检测器,其主干可以是挤压网 [31]、移动网络 [28、66、27、74]或 ShuffleNet [97, 53]。至于头部部分,通常分为两类,即一级目标检测器和两级目标检测器。最具代表性的两级目标检测器是R-CNN[19]系列,包括快速R-CNN[18],更快的R-CNN[64],R-FCN[9]和天秤座R-CNN[58]。也可以使两级目标检测器成为无锚目标检测器,如 RepPoints [87]。至于一级目标检测器,最具代表性的型号是YOLO[61、62、63]、SSD[50]和视网膜[45]。近年来,研制了无锚式单级目标检测器。此类检测器有 CenterNet [13]、角网 [37、 38]、FCOS [78]等。近年来开发的目标检测器通常插入骨干和头部之间的一些层,这些层通常用于收集不同阶段的要素图。我们可以称它为目标检测器的脖子。通常,颈部由几个自下而上的路径和几个自上而下的路径组成。配备此机制的网络包括特征金字塔网络 (FPN) [44]、路径聚合网络 (PAN) [49]、BiFPN [77]和 NAS-FPN [17]。

In addition to the above models, some researchers put their emphasis on directly building a new backbone (DetNet [43], DetNAS [7]) or a new whole model (SpineNet [12], HitDetector [20]) for object detection.
除了上述模型外,一些研究人员还强调直接构建一个新的主干(DetNet [43],DetNAS [7])或新的完整模型(SpineNet [12],HitDetector [20])用于物体检测。

To sum up, an ordinary object detector is composed of several parts:
总之,普通目标检测器由几个部分组成:

2.2 Bag of freebies

Usually, a conventional object detector is trained offline. Therefore, researchers always like to take this advantage and develop better training methods which can make the object detector receive better accuracy without increasing the inference cost. We call these methods that only change the training strategy or only increase the training cost as “bag of freebies.” What is often adopted by object detection methods and meets the definition of bag of freebies is data augmentation. The purpose of data augmentation is to increase the variability of the input images, so that the designed object detection model has higher robustness to the images obtained from different environments. For examples, photometric distortions and geometric distortions are two commonly used data augmentation method and they definitely benefit the object detection task. In dealing with photometric distortion, we adjust the brightness, contrast, hue, saturation, and noise of an image. For geometric distortion, we add random scaling, cropping, flipping, and rotating.
通常,传统的目标检测器是离线训练的。因此,研究人员总是喜欢利用这一优势,开发更好的训练方法,使目标检测器在不增加推理成本的情况下获得更好的精度。我们将这些方法称为”免费赠品包”,这些方法仅改变训练策略或仅增加训练成本。目标检测方法通常采用并符合免费赠品包的定义的是数据扩充。数据扩增的目的是增加输入图像的可变性,使设计的目标检测模型对从不同环境获得的图像具有更高的鲁棒性。例如,光度失真和几何失真是两种常用的数据扩增方法,它们肯定有利于目标检测任务。在处理光度失真时,我们调整图像的亮度、对比度、色调、饱和度和噪声。对于几何失真,我们添加随机缩放、裁剪、翻转和旋转。

The data augmentation methods mentioned above are all pixel-wise adjustments, and all original pixel information in the adjusted area is retained. In addition, some researchers engaged in data augmentation put their emphasis on simulating object occlusion issues. They have achieved good results in image classification and object detection. For example, random erase [100] and CutOut [11] can randomly select the rectangle region in an image and fill in a random or complementary value of zero. As for hide-and-seek [69] and grid mask [6], they randomly or evenly select multiple rectangle regions in an image and replace them to all zeros. If similar concepts are applied to feature maps, there are DropOut [71], DropConnect [80], and DropBlock [16] methods. In addition, some researchers have proposed the methods of using multiple images together to perform data augmentation. For example, MixUp [92] uses two images to multiply and superimpose with different coefficient ratios, and then adjusts the label with these superimposed ratios. As for CutMix [91], it is to cover the cropped image to rectangle region of other images, and adjusts the label according to the size of the mix area. In addition to the above mentioned methods, style transfer GAN [15] is also used for data augmentation, and such usage can effectively reduce the texture bias learned by CNN.
上面提到的数据扩充方法是所有像素调整,并保留调整后区域中的所有原始像素信息。此外,一些从事数据扩增的研究人员强调模拟物体遮挡问题。他们在图像分类和物体检测方面取得了良好的效果。例如,随机擦除 [100] 和 CutOut [11] 可以随机选择图像中的矩形区域并填充零的随机或互补值。至于隐藏和查找 [69] 和网格掩码 [6],它们随机或均匀地选择图像中的多个矩形区域,并将其替换为所有零。如果类似的概念应用于要素地图,则有”退出 “[71]、DropConnect [80]和 DropBlock [16] 方法。此外,一些研究人员提出了将多个图像结合在一起执行数据扩增的方法。例如,MixUp [92] 使用两个图像以不同的系数比率倍增和叠加,然后使用这些叠加比率调整标签。至于 CutMix [91],它是将裁剪的图像覆盖到其他图像的矩形区域,并根据混合区域的大小调整标签。除上述方法外,样式传输GAN[15]还用于数据扩增,这种使用可以有效地减少CNN学到的纹理偏差。

Different from the various approaches proposed above, some other bag of freebies methods are dedicated to solving the problem that the semantic distribution in the dataset may have bias. In dealing with the problem of semantic distribution bias, a very important issue is that there is a problem of data imbalance between different classes, and this problem is often solved by hard negative example mining [72] or online hard example mining [67] in two-stage object detector. But the example mining method is not applicableto one-stage object detector, because this kind of detector belongs to the dense prediction architecture. Therefore Lin et al. [45] proposed focal loss to deal with the problem of data imbalance existing between various classes. Another very important issue is that it is difficult to express the relationship of the degree of association between different categories with the one-hot hard representation. This representation scheme is often used when executing labeling. The label smoothing proposed in [73] is to convert hard label into soft label for training, which can make model more robust. In order to obtain a better soft label, Islam et al. [33] introduced the concept of knowledge distillation to design the label refinement network.
与上述各种方法不同,其他一些免费赠品方法专门用于解决数据集中语义分布可能存在偏差的问题。在处理语义分布偏差问题时,一个很重要的问题是,不同类之间的数据不平衡问题,这个问题往往通过硬负例挖掘[72]或两阶段目标检测器中的联机硬示例挖掘[67]来解决。但实例挖掘方法不适用于一级目标检测器,因为这种检测器属于密集预测架构。因此,Lin等人[45]提出了解决不同类别之间存在的数据不平衡问题的重点损耗问题。另一个非常重要的问题是,很难表达不同类别与一热硬表示的关系。在执行标签时,通常使用此表示方案。[73] 中建议的标签平滑是将硬标签转换为软标签进行训练,这将使模型更加坚固。为了获得更好的软标签,Islam等人引入了知识蒸馏的概念来设计标签细化网络。

The last bag of freebies is the objective function of Bounding Box (BBox) regression. The traditional object detector usually uses Mean Square Error (MSE) to directly perform regression on the center point coordinates and height and width of the BBox, i.e., { x center , y center , w, h } , or the upper left point and the lower right point, i.e., { x top left , y top left , x bottom right , y bottom right } . As for anchor-based method, it is to estimate the corresponding offset, for example { x center offset , y center offset , w offset , h offset } and { x top left offset , y top left offset , x bottom right offset , y bottom right offset } . However, to directly estimate the coordinate values of each point of the BBox is to treat these points as independent variables, but in fact does not consider the integrity of the object itself. In order to make this issue processed better, some researchers recently proposed IoU loss [90], which puts the coverage of predicted BBox area and ground truth BBox area into consideration. The IoU loss computing process will trigger the calculation of the four coordinate points of the BBox by executing IoU with the ground truth, and then connecting the generated results into a whole code. Because IoU is a scale invariant representation, it can solve the problem that when traditional methods calculate the l 1 or l 2 loss of { x, y, w, h } , the loss will increase with the scale. Recently, some researchers have continued to improve IoU loss. For example, GIoU loss [65] is to include the shape and orientation of object in addition to the coverage area. They proposed to find the smallest area BBox that can simultaneously cover the predicted BBox and ground truth BBox, and use this BBox as the denominator to replace the denominator originally used in IoU loss. As for DIoU loss [99], it additionally considers the distance of the center of an object, and CIoU loss [99], on the other hand simultaneously considers the overlapping area, the distance between center points, and the aspect ratio. CIoU can achieve better convergence speed and accuracy on the BBox regression problem.
免费赠品的最后一包是边界框 (BBox) 回归的目标函数。传统的目标检测器通常使用均值方误差 (MSE) 直接对 BBox 的中心点坐标和高度和宽度执行回归,即 {x 居中、y 中心、w、h *或左上点和右下角,即 x 左上、左上、x 右下、右下 *。至于基于锚点的方法,它是估计相应的偏移量,例如 [x 中心偏移,y 中心偏移,w偏移,h偏移量 * 和 + x 左上偏移量,y 左上偏移量,x 右下偏移量,y 右下角偏移 = 。但是,直接估计 BBox 的每个点的坐标值是将这些点视为独立的变量,但实际上不考虑对象本身的完整性。为了更好地处理这一问题,一些研究人员最近提出了IoU损耗[90],将预测的BBox区域和地面真相BBox区域的覆盖范围考虑在内。IoU 损失计算过程将通过执行带有地面真相的 IoU,然后将生成的结果连接到整个代码,触发 BBox 的四个坐标点的计算。由于 IoU 是一个比例不变表示形式,因此它可以解决当传统方法计算 l 1 或 l 2 损失的 x、y、w、h = 时,损耗会随着比例的增加而增加。最近,一些研究人员继续改善IoU损失。例如,GIoU 损耗 [65] 是除覆盖区域外,还包括对象的形状和方向。他们建议找到最小区域 BBox,可以同时覆盖预测的 BBox 和地面真相 BBox,并使用此 BBox 作为分母来替换最初用于 IoU 损耗的分母。至于DIoU损耗[99],它另外考虑物体中心的距离,另一方面,CIoU损耗[99],另一方面同时考虑重叠区域,中心点之间的距离和纵横比。CIoU 可以在 BBox 回归问题上实现更好的收敛速度和准确性。

2.3 Bag of specials

For those plugin modules and post-processing methods that only increase the inference cost by a small amount but can significantly improve the accuracy of object detection, we call them “bag of specials”. Generally speaking, these plugin modules are for enhancing certain attributes in a model, such as enlarging receptive field, introducing attention mechanism, or strengthening feature integration capability, etc., and post-processing is a method for screening model prediction results.
对于那些插件模块和后处理方法,它只增加少量的推理成本,但可以显著提高目标检测的准确性,我们称之为”特殊包”。一般来说,这些插件模块用于增强模型中的某些属性,如扩大接受场、引入关注机制、增强功能集成能力等,后处理是筛选模型预测结果的方法。

Common modules that can be used to enhance receptive field are SPP [25], ASPP [5], and RFB [47]. The SPP module was originated from Spatial Pyramid Matching (SPM) [39], and SPMs original method was to split feature map into several d × d equal blocks, where d can be { 1, 2, 3, … } , thus forming spatial pyramid, and then extracting bag-of-word features. SPP integrates SPM into CNN and use max-pooling operation instead of bag-of-word operation. Since the SPP module proposed by He et al. [25] will output one dimensional feature vector, it is infeasible to be applied in Fully Convolutional Network (FCN). Thus in the design of YOLOv3 [63], Redmon and Farhadi improve SPP module to the concatenation of max-pooling outputs with kernel size k × k, where k = { 1, 5, 9, 13 } , and stride equals to 1. Under this design, a relatively large k × k maxpooling effectively increase the receptive field of backbone feature. After adding the improved version of SPP module, YOLOv3-608 upgrades AP 50 by 2.7% on the MS COCO object detection task at the cost of 0.5% extra computation. The difference in operation between ASPP [5] module and improved SPP module is mainly from the original k×k kernel size, max-pooling of stride equals to 1 to several 3 × 3 kernel size, dilated ratio equals to k, and stride equals to 1 in dilated convolution operation. RFB module is to use several dilated convolutions of k×k kernel, dilated ratio equals to k, and stride equals to 1 to obtain a more comprehensive spatial coverage than ASPP. RFB [47] only costs 7% extra inference time to increase the AP 50 of SSD on MS COCO by 5.7%.
可用于增强接受性字段的常见模块有 SPP [25]、ASPP [5] 和 RFB [47]。SPP 模块源自空间金字塔匹配 (SPM) [39],SPM 的原始方法是将要素映射拆分为多个 d = d 等块,其中 d 可以是 { 1、2、3、……”,从而形成空间金字塔,然后提取单词袋要素。SPP 将 SPM 集成到 CNN 中,并使用最大池化操作,而不是字袋操作。由于He等人提出的SPP模块[25]将输出一维特征矢量,因此在全卷积网络(FCN)中应用是不可行的。因此,在 YOLOv3 [63] 的设计中,Redmon 和 Farhadi 改进了 SPP 模块与内核大小 k = k 的最大池输出的串联,其中 k = = = 1、5、9、13 = 和步长等于 1。在此设计下,相对较大的 k = k maxpool 可有效增加骨干特征的接受场。添加改进版的 SPP 模块后,YOLOv3-608 在 MS COCO 目标检测任务中将 AP 50 升级 2.7%,但额外计算成本为 0.5%。ASPP [5] 模块与改进的SPP模块之间的操作差主要从原来的k+k内核大小,最大步幅池等于1至多个3× 3内核大小,扩张比等于k,在扩张卷积操作中,步长等于1。RFB 模块是使用 k_k 内核的多个扩张卷积,扩张比等于 k,步长等于 1,以获得比 ASPP 更全面的空间覆盖。RFB [47] 只需花费 7% 的额外推理时间,就将 MS COCO 上的 AP 50 SSD 增加 5.7%。

The attention module that is often used in object detection is mainly divided into channel-wise attention and pointwise attention, and the representatives of these two attention models are Squeeze-and-Excitation (SE) [29] and Spatial Attention Module (SAM) [85], respectively. Although SE module can improve the power of ResNet50 in the ImageNet image classification task 1% top-1 accuracy at the cost of only increasing the computational effort by 2%, but on a GPU usually it will increase the inference time by about 10%, so it is more appropriate to be used in mobile devices. But for SAM, it only needs to pay 0.1% extra calculation and it can improve ResNet50-SE 0.5% top-1 accuracy on the ImageNet image classification task. Best of all, it does not affect the speed of inference on the GPU at all.
目标检测中常用的注意模块主要分为通道关注和点关注,这两个注意模型的代表分别是挤压和激发(SE)[29]和空间注意模块[SAM][85]。虽然SE模块可以提高ResNet50在ImageNet图像分类任务中的功率1%的前1精度,而成本只增加计算工作量2%,但在GPU上通常会增加10%左右,因此在移动设备中使用更合适。但对于 SAM,它只需要额外支付 0.1% 的计算费用,并且它可以提高 ResNet50-SE 在 ImageNet 图像分类任务中 0.5% 的前 1 精度。最重要的是,它根本不影响 GPU 上的推理速度。

In terms of feature integration, the early practice is to use skip connection [51] or hyper-column [22] to integrate low-level physical feature to high-level semantic feature. Since multi-scale prediction methods such as FPN have become popular, many lightweight modules that integrate different feature pyramid have been proposed. The modules of this sort include SFAM [98], ASFF [48], and BiFPN [77]. The main idea of SFAM is to use SE module to execute channelwise level re-weighting on multi-scale concatenated feature maps. As for ASFF, it uses softmax as point-wise level reweighting and then adds feature maps of different scales. In BiFPN, the multi-input weighted residual connections is proposed to execute scale-wise level re-weighting, and then add feature maps of different scales.
在功能集成方面,早期做法是使用跳过连接 [51] 或超列 [22] 将低级物理功能集成到高级语义功能。由于FPN等多尺度预测方法已经普及,许多集成不同特征金字塔的轻量级模块被提出来。此类模块包括 SFAM [98]、ASFF [48] 和 BiFPN [77]。SFAM 的主要思想是使用 SE 模块在多比例串联要素映射上执行通道级重新加权。至于ASFF,它使用softmax作为点级重量化,然后添加不同比例的要素贴图。在 BiFPN 中,建议多输入加权残联执行比例级重新加权,然后添加不同比例的特征映射。

In the research of deep learning, some people put their focus on searching for good activation function. A good activation function can make the gradient more efficiently propagated, and at the same time it will not cause too much extra computational cost. In 2010, Nair and Hinton [56] propose ReLU to substantially solve the gradient vanish problem which is frequently encountered in traditional tanh and sigmoid activation function. Subsequently, LReLU [54], PReLU [24], ReLU6 [28], Scaled Exponential Linear Unit (SELU) [35], Swish [59], hard-Swish [27], and Mish [55], etc., which are also used to solve the gradient vanish problem, have been proposed. The main purpose of LReLU and PReLU is to solve the problem that the gradient of ReLU is zero when the output is less than zero. As for ReLU6 and hard-Swish, they are specially designed for quantization networks. For self-normalizing a neural network, the SELU activation function is proposed to satisfy the goal. One thing to be noted is that both Swish and Mish are continuously differentiable activation function.
在深度学习的研究中,有些人把精力放在寻找良好的激活功能上。良好的激活函数可以使渐变更有效地传播,同时也不会造成过多的额外计算成本。2010年,奈尔和欣顿[56]建议ReLU从根本上解决梯度消失问题,这是常见的在传统的tanh和sigmoid激活功能。随后,提出了LReLU[54],PReLU[24],ReLU6[28],缩放指数线性单位(SELU)[35],Swish [59],硬-斯威什[27],和Mish[55]等,也用于解决梯度消失问题,也提出了。LReLU 和 PReLU 的主要目的是解决当输出小于零时 ReLU 梯度为零的问题。至于ReLU6和硬Swish,它们是专门为量化网络设计的。为了实现神经网络的自规范化,提出了符合目标的SELU激活函数。需要注意的是,Swish 和 Mish 都是连续不同的激活功能。

The post-processing method commonly used in deeplearning-based object detection is NMS, which can be used to filter those BBoxes that badly predict the same object, and only retain the candidate BBoxes with higher response. The way NMS tries to improve is consistent with the method of optimizing an objective function. The original method proposed by NMS does not consider the context information, so Girshick et al. [19] added classification confidence score in R-CNN as a reference, and according to the order of confidence score, greedy NMS was performed in the order of high score to low score. As for soft NMS [1], it considers the problem that the occlusion of an object may cause the degradation of confidence score in greedy NMS with IoU score. The DIoU NMS [99] developers way of thinking is to add the information of the center point distance to the BBox screening process on the basis of soft NMS. It is worth mentioning that, since none of above postprocessing methods directly refer to the captured image features, post-processing is no longer required in the subsequent development of an anchor-free method.
在基于深度学习的目标检测中常用的后处理方法是 NMS,它可用于筛选那些预测同一对象的 BBox,并且仅保留响应较高的候选 BBox。NMS 尝试改进的方式与优化目标函数的方法一致。NMS提出的原始方法不考虑上下文信息,因此Girshick等人[19]在R-CNN中添加了分类置信度分数作为参考,并根据置信度分数的顺序,以高分到低分的顺序进行贪婪的NMS。至于软NMS [1],它考虑的问题,对象的遮挡可能会导致在贪婪的NMS与IoU分数的置信度分数下降的问题。DIoU NMS [99] 开发人员的思维方式是在软 NMS 的基础上将中心点距离的信息添加到 BBox 筛选过程中。值得一提的是,由于上述后处理方法没有一种直接是指捕获的图像特征,因此在后续开发无锚方法时不再需要后处理。

3 Methodology

The basic aim is fast operating speed of neural network, in production systems and optimization for parallel computations, rather than the low computation volume theoretical indicator (BFLOP). We present two options of real-time neural networks:
基本目标是加快神经网络的运行速度,在生产系统中优化并行计算,而不是低计算量理论指标(BFLOP)。我们提出了两个实时神经网络选项:

  • For GPU we use a small number of groups (1 – 8) in convolutional layers: CSPResNeXt50 / CSPDarknet53
    对于 GPU,我们在卷积层中使用少量组 (1 – 8): CSPReSNeXt50 / CSPDarknet53
  • For VPU – we use grouped-convolution, but we refrain from using Squeeze-and-excitement (SE) blocks – specifically this includes the following models: EfficientNet-lite / MixNet [76] / GhostNet [21] / MobileNetV3
    对于 VPU – 我们使用分组卷积,但我们不使用挤压和兴奋 (SE) 块 – 具体来说,这包括以下型号: 高效网络精简版 / MixNet [76] / GhostNet [21] / MobileNetV3

3.1 Selection of architecture

Our objective is to find the optimal balance among the input network resolution, the convolutional layer number, the parameter number (filter size 2 * filters * channel / groups), and the number of layer outputs (filters). For instance, our numerous studies demonstrate that the CSPResNext50 is considerably better compared to CSPDarknet53 in terms of object classification on the ILSVRC2012 (ImageNet) dataset [10]. However, conversely, the CSPDarknet53 is better compared to CSPResNext50 in terms of detecting objects on the MS COCO dataset [46].
我们的目标是在输入网络分辨率、卷积层数、参数编号(滤波器大小 2 + 滤波器 + 通道/组)和图层输出数(筛选器)之间找到最佳平衡。例如,我们的大量研究表明,在 ILSVRC2012 (ImageNet) 数据集 [10] 上,CSPResNext50比 CSPDarknet53要好得多。但是,相反,在检测 MS COCO 数据集 [46] 上的对象方面,CSPDarknet53 优于 CSPResNext50。

The next objective is to select additional blocks for increasing the receptive field and the best method of parameter aggregation from different backbone levels for different detector levels: e.g. FPN, PAN, ASFF, BiFPN.
下一个目标是选择其他模块,以增加接受场和不同主干级别不同骨干级参数聚合的最佳方法:例如FPN、PAN、ASFF、BiFPN。

A reference model which is optimal for classification is not always optimal for a detector. In contrast to the classifier, the detector requires the following:
最佳分类的参考模型并不总是检测器的最佳模型。与分类器相比,检测器需要以下操作:

  • Higher input network size (resolution) – for detecting multiple small-sized objects
    更高的输入网络大小(分辨率) – 用于检测多个小型对象
  • More layers – for a higher receptive field to cover the increased size of input network
    更多层 – 更高的接受场,以覆盖输入网络增加的大小
  • More parameters – for greater capacity of a model to detect multiple objects of different sizes in a single image
    更多参数 – 提高模型的容量,以检测单个图像中的多个不同大小的对象

Hypothetically speaking, we can assume that a model with a larger receptive field size (with a larger number of convolutional layers 3 × 3) and a larger number of parameters should be selected as the backbone. Table 1 shows the information of CSPResNeXt50, CSPDarknet53, and EfficientNet B3. The CSPResNext50 contains only 16 convolutional layers 3 × 3, a 425 × 425 receptive field and 20.6 M parameters, while CSPDarknet53 contains 29 convolutional layers 3 × 3, a 725 × 725 receptive field and 27.6 M parameters. This theoretical justification, together with our numerous experiments, show that CSPDarknet53 neural network is the optimal model of the two as the backbone for a detector.
假设,我们可以假设一个接收场大小的模型(具有较多的卷积层 3 × 3)和更多的参数应作为主干。表 1 显示了 CSPReSNeXt50、CSPDarknet53 和高效网络 B3 的信息。CSPResNext50 仅包含 16 个卷积层 3 × 3、425 × 425 接收场和 20.6 M 参数,CSPDarknet53 包含 29 个卷积层 3 × 3、725 × 725 接受场和 27.6 M 参数。这一理论论证,加上我们的大量实验,表明CSPDarknet53神经网络是两者作为检测器骨干的最佳模型。

The influence of the receptive field with different sizes is summarized as follows:
不同大小的接受场的影响总结如下:

  • Up to the object size – allows viewing the entire object
    最多到对象大小 – 允许查看整个对象
  • Up to network size – allows viewing the context around the object
    最多网络大小 – 允许查看对象周围的上下文
  • Exceeding the network size – increases the number of connections between the image point and the final activation
    超过网络大小 – 增加图像点和最终激活之间的连接数

We add the SPP block over the CSPDarknet53, since it significantly increases the receptive field, separates out the most significant context features and causes almost no reduction of the network operation speed. We use PANet as the method of parameter aggregation from different backbone levels for different detector levels, instead of the FPN used in YOLOv3.
我们在 CSPDarknet53 上添加 SPP 块,因为它显著增加了接受面,分离出最重要的上下文功能,并且几乎不降低网络操作速度。我们使用 PANet 作为不同检测器级别的不同骨干级参数聚合方法,而不是 YOLOv3 中使用的 FPN。

Finally, we choose CSPDarknet53 backbone, SPP additional module, PANet path-aggregation neck, and YOLOv3 (anchor based) head as the architecture of YOLOv4.
最后,我们选择 CSPDarknet53 主干、SPP 附加模块、PANet 路径聚合颈部和 YOLOv3(基于锚)头作为 YOLOv4 的体系结构。

In the future we plan to expand significantly the content of Bag of Freebies (BoF) for the detector, which theoretically can address some problems and increase the detector accuracy, and sequentially check the influence of each feature in an experimental fashion.
今后,我们计划大幅扩展检测器免费赠品包(BoF)的含量,理论上可以解决一些问题,提高检测器的精度,并按顺序以实验方式检查每个功能的影响。

We do not use Cross-GPU Batch Normalization (CGBN or SyncBN) or expensive specialized devices. This allows anyone to reproduce our state-of-the-art outcomes on a conventional graphic processor e.g. GTX 1080Ti or RTX 2080Ti.
我们不使用跨 GPU 批处理规范化 (CGBN 或 SyncBN) 或昂贵的专用设备。这允许任何人在传统的图形处理器上重现我们最先进的结果,例如 GTX 1080Ti 或 RTX 2080Ti。

3.2. Selection of BoF and BoS

For improving the object detection training, a CNN usually uses the following:
为了改进目标检测训练,CNN 通常使用以下内容:

  • Activations: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish, or Mish
    激活:ReLU、漏漏、参数-ReLU、ReLU6、SELU、Swish 或 Mish*
  • Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU
    边界框回归损耗:MSE、IoU、GIoU、CIoU、DIoU *
  • Data augmentation: CutOut, MixUp, CutMix
    数据增强:剪切、混合、剪切混合*
  • Regularization method: DropOut, DropPath [36], Spatial DropOut [79], or DropBlock
    正化方法:退出、丢弃路径 [36]、空间退出 [79]或 DropBlock
  • Normalization of the network activations by their mean and variance: Batch Normalization (BN) [32], Cross-GPU Batch Normalization (CGBN or SyncBN) [93], Filter Response Normalization (FRN) [70], or Cross-Iteration Batch Normalization (CBN) [89]
    网络激活的规范化由其平均值和方差:批处理规范化 (BN) [32],跨 GPU 批处理规范化 (CGBN 或 SyncBN) [93],筛选器响应规范化 (FRN) [70],或跨迭代批处理规范化 (CBN) [89]
  • Skip-connections: Residual connections, Weighted residual connections, Multi-input weighted residual connections, or Cross stage partial connections (CSP)
    跳过连接:剩余连接、加权剩余连接、多输入加权剩余连接或交叉阶段部分连接 (CSP)

As for training activation function, since PReLU and SELU are more difficult to train, and ReLU6 is specifically designed for quantization network, we therefore remove the above activation functions from the candidate list. In the method of reqularization, the people who published DropBlock have compared their method with other methods in detail, and their regularization method has won a lot. Therefore, we did not hesitate to choose DropBlock as our regularization method. As for the selection of normalization method, since we focus on a training strategy that uses only one GPU, syncBN is not considered.
至于训练激活功能,由于 PReLU 和 SELU 更难训练,并且 ReLU6 是专门为量化网络设计的,因此我们从候选列表中删除了上述激活功能。在重量化方法中,发布 DropBlock 的人将方法与其他方法进行了详细的比较,其正则化方法赢得了很多。因此,我们毫不犹豫地选择 DropBlock 作为我们的规范化方法。至于规范化方法的选择,由于我们专注于只使用一个 GPU 的训练策略,因此不考虑 syncBN。

3.3. Additional improvements

In order to make the designed detector more suitable for training on single GPU, we made additional design and improvement as follows:
为了使设计的检测器更适合于单 GPU 的训练,我们进行了如下其他设计和改进:

  • We introduce a new method of data augmentation Mosaic, and Self-Adversarial Training (SAT)
    我们介绍了一种新的数据扩增马赛克和自我对抗训练(SAT)的方法
  • We select optimal hyper-parameters while applying genetic algorithms
    在应用遗传算法时,我们选择最佳的超参数

We modify some exsiting methods to make our design suitble for efficient training and detection – modified SAM, modified PAN, and Cross mini-Batch Normalization (CmBN)
我们修改了一些外在方法,使我们的设计适合高效的训练和检测 – 修改的 SAM、修改的 PAN 和交叉微型批处理规范化 (CmBN)

Mosaic represents a new data augmentation method that mixes 4 training images. Thus 4 different contexts are mixed, while CutMix mixes only 2 input images. This allows detection of objects outside their normal context. In addition, batch normalization calculates activation statistics from 4 different images on each layer. This significantly reduces the need for a large mini-batch size.
马赛克是一种新的数据扩增方法,它混合了4个训练图像。因此,4个不同的上下文混合,而CutMix只混合2个输入图像。这允许检测其正常上下文之外的对象。此外,批处理规范化计算每个层上 4 个不同图像的激活统计信息。这大大减少了对大型小型批次尺寸的需求。

Figure 3: Mosaic represents a new method of data augmentation.Figure 3: Mosaic represents a new method of data augmentation.

Self-Adversarial Training (SAT) also represents a new data augmentation technique that operates in 2 forward backward stages. In the 1st stage the neural network alters the original image instead of the network weights. In this way the neural network executes an adversarial attack on itself, altering the original image to create the deception that there is no desired object on the image. In the 2nd stage, the neural network is trained to detect an object on this modified image in the normal way.
自我对抗训练 (SAT) 还代表一种新的数据扩增技术,可在 2 个向前向后阶段运行。在第一阶段,神经网络更改原始图像而不是网络权重。通过这种方式,神经网络对自身执行对抗攻击,更改原始图像以创建映像上没有所需对象的欺骗。在第二阶段,神经网络被训练以正常方式检测此修改图像上的对象。

CmBN represents a CBN modified version, as shown in Figure 4, defined as Cross mini-Batch Normalization (CmBN). This collects statistics only between mini-batches within a single batch.
CmBN 表示 CBN 修改版本,如图 4 所示,定义为交叉小批量规范化 (CmBN)。这仅在单个批处理中的微型批处理之间收集统计信息。

Figure 4: Cross mini-Batch Normalization.Figure 4: Cross mini-Batch Normalization.

We modify SAM from spatial-wise attention to pointwise attention, and replace shortcut connection of PAN to concatenation, as shown in Figure 5 and Figure 6, respectively.
我们从空间上的注意力到点注意力来修改SAM,并将 PAN 的快捷方式连接改为串联,如图 5 和图 6 所示。

Figure 5: Modified SAM.Figure 5: Modified SAM.

Figure 6: Modified PAN.Figure 6: Modified PAN.

3.4 YOLO V4

In this section, we shall elaborate the details of YOLOv4.
在本节中,我们将详细阐述YOLOv4的细节。

YOLOv4 consists of:

  • Backbone: CSPDarknet53 [81]
  • Neck: SPP [25], PAN [49]
  • Head: YOLOv3 [63]

YOLOv4 包括:
·

  • 主干: CSPDarknet53 [81]
    ·
  • 颈部: SPP [25], PAN [49]
    ·
  • 头部: YOLOv3 [63]

YOLO v4 uses:

  • Bag of Freebies (BoF) for backbone: CutMix and Mosaic data augmentation, DropBlock regularization, Class label smoothing

YOLO v4 使用: *

主干免费赠品袋 (BoF): CutMix 和马赛克数据扩增、 DropBlock 正则化、 类标签平滑 *

  • Bag of Specials (BoS) for backbone: Mish activation, Cross-stage partial connections (CSP), Multiinput weighted residual connections (MiWRC)
    主干专用袋 (BoS):误线激活、跨阶段部分连接 (CSP)、多输入加权剩余连接 (MiWRC)]
  • Bag of Freebies (BoF) for detector: CIoU-loss, CmBN, DropBlock regularization, Mosaic data augmentation, Self-Adversarial Training, Eliminate grid sensitivity, Using multiple anchors for a single ground truth, Cosine annealing scheduler [52], Optimal hyperparameters, Random training shapes
    用于检测器的免费赠品袋 (BoF): CIoU 损失、 CmBN、 DropBlock 正化、 马赛克数据扩增、 自我对抗训练、 消除网格灵敏度、 使用多个锚点进行单一地面真相、 柯辛退火调度程序 [52]、最佳超参数、随机训练形状
  • Bag of Specials (BoS) for detector: Mish activation, SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS
    用于检测器的特有包 (BoS): 误区激活、 SPP 块、 SAM 块、 PAN 路径聚合块、 DIoU-NMS

4 Experiments

We test the influence of different training improvement techniques on accuracy of the classifier on ImageNet (ILSVRC 2012 val) dataset, and then on the accuracy of the detector on MS COCO (test-dev 2017) dataset.
我们测试不同训练改进技术对 ImageNet (ILSVRC 2012 val) 数据集上的分类器准确性的影响,然后测试 MS COCO (test-dev 2017) 数据集上的检测器的准确性。

4.1. Experimental setup

In ImageNet image classification experiments, the default hyper-parameters are as follows: the training steps is 8,000,000; the batch size and the mini-batch size are 128 and 32, respectively; the polynomial decay learning rate scheduling strategy is adopted with initial learning rate 0.1; the warm-up steps is 1000; the momentum and weight decay are respectively set as 0.9 and 0.005. All of our BoS experiments use the same hyper-parameter as the default setting, and in the BoF experiments, we add an additional 50% training steps. In the BoF experiments, we verify MixUp, CutMix, Mosaic, Bluring data augmentation, and label smoothing regularization methods. In the BoS experiments, we compared the effects of LReLU, Swish, and Mish activation function. All experiments are trained with a 1080 Ti or 2080 Ti GPU.
在 ImageNet 图像分类实验中,默认的超参数如下:训练步骤为 8,000,000;批次大小和小批次大小分别为128和32;采用多项式衰减学习率调度策略,初始学习速率为0.1;预热步骤为1000;动量和重量衰减分别设置为 0.9 和 0.005。我们所有的 BoS 实验都使用与默认设置相同的超参数,在 BoF 实验中,我们增加了 50% 的训练步骤。在 BoF 实验中,我们验证 MixUp、CutMix、马赛克、模糊数据扩增和标签平滑规律化方法。在BoS实验中,我们比较了LReLU、Swish和Mish激活函数的影响。所有实验都使用 1080 Ti 或 2080 Ti GPU 进行训练。

In MS COCO object detection experiments, the default hyper-parameters are as follows: the training steps is 500,500; the step decay learning rate scheduling strategy is adopted with initial learning rate 0.01 and multiply with a factor 0.1 at the 400,000 steps and the 450,000 steps, respectively; The momentum and weight decay are respectively set as 0.9 and 0.0005. All architectures use a single GPU to execute multi-scale training in the batch size of 64 while mini-batch size is 8 or 4 depend on the architectures and GPU memory limitation. Except for using genetic algorithm for hyper-parameter search experiments, all other experiments use default setting. Genetic algorithm used YOLOv3-SPP to train with GIoU loss and search 300 epochs for min-val 5k sets. We adopt searched learning rate 0.00261, momentum 0.949, IoU threshold for assigning ground truth 0.213, and loss normalizer 0.07 for genetic algorithm experiments. We have verified a large number of BoF, including grid sensitivity elimination, mosaic data augmentation, IoU threshold, genetic algorithm, class label smoothing, cross mini-batch normalization, selfadversarial training, cosine annealing scheduler, dynamic mini-batch size, DropBlock, Optimized Anchors, different kind of IoU losses. We also conduct experiments on various BoS, including Mish, SPP, SAM, RFB, BiFPN, and Gaussian YOLO [8]. For all experiments, we only use one GPU for training, so techniques such as syncBN that optimizes multiple GPUs are not used.
在MS COCO目标检测实验中,默认超参数如下:训练步骤为500,500;采用步减学习率调度策略,初始学习速率0.01,在400,000步和45万步时乘以因子0.1;动量和重量衰减分别设置为 0.9 和 0.0005。所有体系结构都使用单个 GPU 执行批处理大小为 64 的多级训练,而小批处理大小为 8 或 4 取决于体系结构和 GPU 内存限制。除了使用遗传算法进行超参数搜索实验外,所有其他实验都使用默认设置。遗传算法使用YOLOv3-SPP训练与GIoU损失和搜索300个纪元为最小瓦尔5k集。采用搜索学习速率0.00261、动量0.949、分配地真0.213的IoU阈值和遗传算法实验的损耗规范化器0.07。我们已经验证了大量的BoF,包括网格灵敏度消除、镶嵌数据扩增、IoU阈值、遗传算法、类标签平滑、交叉小批量规范化、自对抗性训练、原蛋白退火调度器、动态小批量尺寸、DropBlock、优化锚点、不同类型的IoU损耗。我们还对各种 BoS 进行实验,包括米什、SPP、SAM、RFB、BiFPN 和高斯 YOLO [8]。对于所有实验,我们只使用一个 GPU 进行训练,因此不使用同步BN优化多个 GPU 等技术。

4.2. Influence of different features on Classifier training

First, we study the influence of different features on classifier training; specifically, the influence of Class label smoothing, the influence of different data augmentation techniques, bilateral blurring, MixUp, CutMix and Mosaic, as shown in Fugure 7, and the influence of different activations, such as Leaky-ReLU (by default), Swish, and Mish.
首先,研究了不同特点对分类器训练的影响;具体来说,类标签平滑的影响,不同数据扩增技术的影响,双边模糊,MixUp,CutMix和马赛克,如Fugure 7所示,以及不同激活的影响,如泄漏-ReLU(默认情况下),Swish和Mish。

Figure 7: Various method of data augmentation.Figure 7: Various method of data augmentation.

In our experiments, as illustrated in Table 2, the classifier’s accuracy is improved by introducing the features such as: CutMix and Mosaic data augmentation, Class label smoothing, and Mish activation. As a result, our BoFbackbone (Bag of Freebies) for classifier training includes the following: CutMix and Mosaic data augmentation and Class label smoothing. In addition we use Mish activation as a complementary option, as shown in Table 2 and Table 3.
在我们的实验中,如表 2 所示,分类器的精度通过引入以下功能得到提高,例如:CutMix 和马赛克数据扩增、类标签平滑和 Mish 激活。因此,我们的 BoF骨干(免费赠品袋)用于分类器训练包括以下内容:CutMix 和马赛克数据扩增和类标签平滑。此外,我们使用 Mish 激活作为补充选项,如表 2 和表 3 所示。

4.3 Influence of different features on Detector training

Further study concerns the influence of different Bag-ofFreebies (BoF-detector) on the detector training accuracy, as shown in Table 4. We significantly expand the BoF list through studying different features that increase the detector accuracy without affecting FPS:
进一步研究涉及不同的免费包(BoF-检测器)对检测器训练精度的影响,如表4所示。我们通过研究在不影响 FPS 的情况下提高检测器精度的不同功能,显著扩展了 BoF 列表:

  • S: Eliminate grid sensitivity the equation b x = σ(t x )+ c x , b y = σ(t y )+c y , where c x and c y are always whole numbers, is used in YOLOv3 for evaluating the object coordinates, therefore, extremely high t x absolute values are required for the b x value approaching the c x or c x + 1 values. We solve this problem through multiplying the sigmoid by a factor exceeding 1.0, so eliminating the effect of grid on which the object is undetectable.
    S: 消除网格灵敏度方程 b x = (t x ) c x, b y = (t y ) c y , 其中 c x 和 c y 始终为整数, 在 YOLOv3 中使用用于评估对象坐标, 因此,接近 c x 或 c x = 1 值的 b x 值需要极高的 t x 绝对值。我们通过将 sigmoid 乘以超过 1.0 的因子来解决此问题,从而消除了对象无法检测到的网格的影响。
  • M: Mosaic data augmentation – using the 4-image mosaic during training instead of single image
    M:马赛克数据扩增 – 在训练期间使用 4 图像镶嵌,而不是单个图像
  • IT: IoU threshold – using multiple anchors for a single ground truth IoU (truth, anchor) > IoU threshold
    IT:IoU 阈值 – 使用多个锚点进行单个接地真相 IoU(真、锚)和 IoU 阈值 |
  • GA: Genetic algorithms – using genetic algorithms for selecting the optimal hyperparameters during network training on the first 10% of time periods
    GA:遗传算法 – 在前 10% 的时间段的网络训练期间使用遗传算法选择最佳超参数*
  • LS: Class label smoothing – using class label smoothing for sigmoid activation
    LS:类标签平滑 – 使用类标签平滑进行 sigmoid 激活
  • CBN: CmBN – using Cross mini-Batch Normalization for collecting statistics inside the entire batch, instead of collecting statistics inside a single mini-batch
    CBN: CmBN – 使用交叉小批处理规范化收集整个批处理中的统计信息,而不是在单个小批处理中收集统计信息*
  • CA: Cosine annealing scheduler – altering the learning rate during sinusoid training
    CA:协和素退火调度器 – 改变正弦训练中的学习速率*
  • DM: Dynamic mini-batch size – automatic increase of mini-batch size during small resolution training by using Random training shapes
    DM:动态小批量尺寸 – 使用随机训练形状在小分辨率训练期间自动增加小批量大小
  • OA: Optimized Anchors – using the optimized anchors for training with the 512×512 network resolution
    OA:优化的锚点 – 使用优化的锚点进行 512×512 网络分辨率的训练*
  • GIoU, CIoU, DIoU, MSE – using different loss algorithms for bounded box regression
    GIoU、CIoU、DIoU、MSE – 对边界框回归使用不同的损耗算法

Further study concerns the influence of different Bagof-Specials (BoS-detector) on the detector training accuracy, including PAN, RFB, SAM, Gaussian YOLO (G), and ASFF, as shown in Table 5. In our experiments, the detector gets best performance when using SPP, PAN, and SAM.
进一步研究涉及不同的巴戈夫特辑(BoS-检测器)对检测器训练精度的影响,包括PAN、RFB、SAM、高斯YOLO(G)和ASFF,如表5所示。在我们的实验中,检测器在使用 SPP、PAN 和 SAM 时获得最佳性能。

4.4 Influence of different backbones and pretrained weightings on Detector training

Further on we study the influence of different backbone models on the detector accuracy, as shown in Table 6. We notice that the model characterized with the best classification accuracy is not always the best in terms of the detector accuracy.
进一步研究了不同骨干模型对检测器精度的影响,如表6所示。我们注意到,在检测器精度方面,具有最佳分类精度的模型并不总是最好的。

First, although classification accuracy of CSPResNeXt50 models trained with different features is higher compared to CSPDarknet53 models, the CSPDarknet53 model shows higher accuracy in terms of object detection.
首先,虽然与CSPDarknet53模型相比,训练具有不同功能的CSPReSNeXt50型号的分类精度较高,但CSPDarknet53模型在物体检测方面表现出更高的精度。

Second, using BoF and Mish for the CSPResNeXt50 classifier training increases its classification accuracy, but further application of these pre-trained weightings for detector training reduces the detector accuracy. However, using BoF and Mish for the CSPDarknet53 classifier training increases the accuracy of both the classifier and the detector which uses this classifier pre-trained weightings. The net result is that backbone CSPDarknet53 is more suitable for the detector than for CSPResNeXt50.
其次,将BoF和Mish用于CSPResNeXt50分类器训练,提高了其分类精度,但进一步将这些预先训练的权重应用于检测器训练,降低了检测器的精度。但是,将 BoF 和 Mish 用于 CSPDarknet53 分类器训练可提高分类器和检测器的准确性,后者使用此分类器预先训练的权重。最终结果是,主干 CSPDarknet53 更适合检测器,而不是 CSPResNeXt50。

We observe that the CSPDarknet53 model demonstrates a greater ability to increase the detector accuracy owing to various improvements.
我们观察到,CSPDarknet53模型表明,由于各种改进,提高了检测器精度的能力。

4.5 Influence of different mini-batch size on Detector training

Finally, we analyze the results obtained with models trained with different mini-batch sizes, and the results are shown in Table 7. From the results shown in Table 7, we found that after adding BoF and BoS training strategies, the mini-batch size has almost no effect on the detector’s performance. This result shows that after the introduction of BoF and BoS, it is no longer necessary to use expensive GPUs for training. In other words, anyone can use only a conventional GPU to train an excellent detector.
最后,分析了使用不同小批次大小的模型获得的结果,结果显示在表7中。从表7所示的结果中,我们发现,在添加了BoF和BoS训练策略后,小批量大小对检测器的性能几乎没有影响。这一结果表明,在引入BoF和BoS后,不再需要使用昂贵的 GPU 进行训练。换句话说,任何人都可以只使用传统的 GPU 来训练出色的检测器。

Figure 8: Comparison of the speed and accuracy of different object detectors. (Some articles stated the FPS of their detectors for only one of the GPUs: Maxwell/Pascal/Volta)Figure 8: Comparison of the speed and accuracy of different object detectors. (Some articles stated the FPS of their detectors for only one of the GPUs: Maxwell/Pascal/Volta)

5 Results

Comparison of the results obtained with other state-of-the-art object detectors are shown in Figure 8. Our YOLOv4 are located on the Pareto optimality curve and are superior to the fastest and most accurate detectors in terms of both speed and accuracy.
图8显示了与其他最先进的目标检测器结果的比较。我们的 YOLOv4 位于帕雷托最佳曲线上,在速度和精度方面优于最快、最精确的检测器。

Since different methods use GPUs of different architectures for inference time verification, we operate YOLOv4 on commonly adopted GPUs of Maxwell, Pascal, and Volta architectures, and compare them with other state-of-the-art methods. Table 8 lists the frame rate comparison results of using Maxwell GPU, and it can be GTX Titan X (Maxwell) or Tesla M40 GPU. Table 9 lists the frame rate comparison results of using Pascal GPU, and it can be Titan X (Pascal), Titan Xp, GTX 1080 Ti, or Tesla P100 GPU. As for Table 10, it lists the frame rate comparison results of using Volta GPU, and it can be Titan Volta or Tesla V100 GPU.
由于不同方法使用不同体系结构的 GPU 进行推理时间验证,因此我们在通常采用的 Maxwell、Pascal 和 Volta 体系结构的 GPU 上运行 YOLOv4,并将其与其他最先进的方法进行比较。表 8 列出了使用 Maxwell GPU 的帧速率比较结果,它可以是 GTX 泰坦 X (Maxwell) 或特斯拉 M40 GPU。表 9 列出了使用 Pascal GPU 的帧速率比较结果,它可以是泰坦 X (帕斯卡尔)、泰坦 Xp、GTX 1080 Ti 或特斯拉 P100 GPU。至于表10,它列出了使用Volta GPU的帧速率比较结果,它可以是泰坦伏尔塔或特斯拉V100 GPU。

6 Conclusions

We offer a state-of-the-art detector which is faster (FPS) and more accurate (MS COCO AP 50…95 and AP 50 ) than all available alternative detectors. The detector described can be trained and used on a conventional GPU with 8-16 GB-VRAM this makes its broad use possible. The original concept of one-stage anchor-based detectors has proven its viability. We have verified a large number of features, and selected for use such of them for improving the accuracy of both the classifier and the detector. These features can be used as best-practice for future studies and developments.
我们提供最先进的检测器,它比所有可用的替代检测器更快(FPS)和更准确(MS COCO AP 50…95 和 AP 50)。所述检测器可在具有 8-16 GB-VRAM 的传统 GPU 上进行训练和使用,这使得其广泛使用成为可能。一级锚式检测器的最初概念已证明其可行性。我们已经验证了大量的功能,并选择用于这些功能,以提高分类器和检测器的准确性。这些功能可用作未来研究和发展的最佳实践。

7 Acknowledgements

The authors wish to thank Glenn Jocher for the ideas of Mosaic data augmentation, the selection of hyper-parameters by using genetic algorithms and solving the grid sensitivity problem https://github.com/ ultralytics/yolov3.
作者希望感谢格伦·乔彻的想法,马赛克数据扩增,选择超参数通过使用遗传算法和解决网格敏感性问题https://github.com/超溶质/yolov3。

参考文献

References

[1] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5561–5569, 2017. 4

[2] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6154–6162, 2018. 12

[3] Jiale Cao, Yanwei Pang, Jungong Han, and Xuelong Li. Hierarchical shot detector. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9705–9714, 2019. 12

[4] Ping Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, and Youn-Long Lin. HarDNet: A low memory traffic network. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019. 13

[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(4):834–848, 2017. 2, 4

[6] Pengguang Chen. GridMask data augmentation. arXiv preprint arXiv:2001.04086, 2020. 3

[7] Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Xinyu Xiao, and Jian Sun. DetNAS: Backbone search for object detection. In Advances in Neural Information Processing Systems (NeurIPS), pages 6638–6648, 2019. 2

[8] Jiwoong Choi, Dayoung Chun, Hyun Kim, and Hyuk-Jae Lee. Gaussian YOLOv3: An accurate and fast object detector using localization uncertainty for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 502–511, 2019. 7

[9] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems (NIPS), pages 379–387, 2016. 2

[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009. 5

[11] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with CutOut. arXiv preprint arXiv:1708.04552, 2017. 3

[12] Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V Le, and Xiaodan Song. SpineNet: Learning scale-permuted backbone for recognition and localization. arXiv preprint arXiv:1912.05027, 2019. 2

[13] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. CenterNet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6569–6578, 2019. 2, 12

[14] Cheng-Yang Fu, Mykhailo Shvets, and Alexander C Berg. RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free. arXiv preprint arXiv:1901.03353, 2019. 12

[15] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations (ICLR), 2019. 3

[16] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. DropBlock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems (NIPS), pages 10727–10737, 2018. 3

[17] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 70367045, 2019. 2, 13

[18] Ross Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015. 2

[19] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014. 2, 4

[20] Jianyuan Guo, Kai Han, Yunhe Wang, Chao Zhang, Zhaohui Yang, Han Wu, Xinghao Chen, and Chang Xu. HitDetector: Hierarchical trinity architecture search for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2

[21] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. GhostNet: More features from cheap operations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 5

[22] Bharath Hariharan, Pablo Arbel´aez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 447–456, 2015. 4

[23] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2961–2969, 2017. 2

[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015. 4

[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(9):1904–1916, 2015. 2, 4, 7

[26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

Deep residual learning for image recognition. In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 2

[27] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for MobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019. 2, 4

[28] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 2, 4

[29] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 71327141, 2018. 4

[30] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 47004708, 2017. 2

[31] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size. arXiv preprint arXiv:1602.07360, 2016. 2

[32] Sergey Ioffe and Christian Szegedy. Batch normalization:

Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 6

[33] Md Amirul Islam, Shujon Naha, Mrigank Rochan, Neil Bruce, and Yang Wang. Label refinement network for coarse-to-fine semantic segmentation. arXiv preprint arXiv:1703.00551, 2017. 3

[34] Seung-Wook Kim, Hyong-Keun Kook, Jee-Young Sun, Mun-Cheon Kang, and Sung-Jea Ko. Parallel feature pyramid network for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 234–250, 2018. 11

[35] G¨unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 971–980, 2017. 4

[36] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. FractalNet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016. 6

[37] Hei Law and Jia Deng. CornerNet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018. 2, 11

[38] Hei Law, Yun Teng, Olga Russakovsky, and Jia Deng.

CornerNet-Lite: Efficient keypoint based object detection. arXiv preprint arXiv:1904.08900, 2019. 2

[39] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 2169–2178. IEEE, 2006. 4

[40] Youngwan Lee and Jongyoul Park. CenterMask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 12, 13

[41] Shuai Li, Lingxiao Yang, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Dynamic anchor feature selection for single-shot object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6609–6618, 2019. 12

[42] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-aware trident networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6054–6063, 2019. 12

[43] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. DetNet: Design backbone for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 334–350, 2018. 2

[44] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017. 2

[45] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. 2, 3, 11, 13

[46] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pages 740–755, 2014. 5

[47] Songtao Liu, Di Huang, et al. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 385–400, 2018. 2, 4, 11

[48] Songtao Liu, Di Huang, and Yunhong Wang. Learning spatial fusion for single-shot object detection. arXiv preprint arXiv:1911.09516, 2019. 2, 4, 13

[49] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8759–8768, 2018. 1, 2, 7

[50] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), pages 21–37, 2016. 2, 11

[51] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015. 4

[52] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 7

[53] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNetV2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018. 2

[54] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of International Conference on Machine Learning (ICML), volume 30, page 3, 2013. 4

[55] Diganta Misra. Mish: A self regularized nonmonotonic neural activation function. arXiv preprint arXiv:1908.08681, 2019. 4

[56] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of International Conference on Machine Learning (ICML), pages 807–814, 2010. 4

[57] Jing Nie, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, and Ling Shao. Enriched feature guided refinement network for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9537–9546, 2019. 12

[58] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. Libra R-CNN: Towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 821–830, 2019. 2, 12

[59] Prajit Ramachandran, Barret Zoph, and Quoc V Le.

Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017. 4

[60] Abdullah Rashwan, Agastya Kalra, and Pascal Poupart. Matrix Nets: A new deep architecture for object detection. In Proceedings of the IEEE International Conference on Computer Vision Workshop (ICCV Workshop), pages 0–0, 2019. 2

[61] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779788, 2016. 2

[62] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 72637271, 2017. 2

[63] Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. 2, 4, 7, 11

[64] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), pages 91–99, 2015. 2

[65] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 658–666, 2019. 3

[66] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018. 2

[67] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 761–769, 2016. 3

[68] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 2

[69] Krishna Kumar Singh, Hao Yu, Aron Sarmasi, Gautam Pradeep, and Yong Jae Lee. Hide-and-Seek: A data augmentation technique for weakly-supervised localization and beyond. arXiv preprint arXiv:1811.02545, 2018. 3

[70] Saurabh Singh and Shankar Krishnan. Filter response normalization layer: Eliminating batch dependence in the training of deep neural networks. arXiv preprint arXiv:1911.09737, 2019. 6

[71] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. DropOut: A simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014. 3

[72] K-K Sung and Tomaso Poggio. Example-based learning for view-based human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 20(1):39–51, 1998. 3

[73] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016. 3

[74] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MNASnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2820–2828, 2019. 2

[75] Mingxing Tan and Quoc V Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of International Conference on Machine Learning (ICML), 2019. 2

[76] Mingxing Tan and Quoc V Le. MixNet: Mixed depthwise convolutional kernels. In Proceedings of the British Machine Vision Conference (BMVC), 2019. 5

[77] Mingxing Tan, Ruoming Pang, and Quoc V Le. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 4, 13

[78] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9627–9636, 2019. 2

[79] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648–656, 2015. 6

[80] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using DropConnect. In Proceedings of International Conference on Machine Learning (ICML), pages 1058–1066, 2013. 3

[81] Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. CSPNet: A new backbone that can enhance learning capability of cnn. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPR Workshop), 2020. 2, 7

[82] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and Dahua Lin. Region proposal by guided anchoring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2965–2974, 2019. 12

[83] Shaoru Wang, Yongchao Gong, Junliang Xing, Lichao Huang, Chang Huang, and Weiming Hu. RDSNet: A new deep architecture for reciprocal object detection and instance segmentation. arXiv preprint arXiv:1912.05070, 2019. 13

[84] Tiancai Wang, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, and Ling Shao. Learning rich features at high-speed for single-shot object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1971–1980, 2019. 11

[85] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018. 1, 2, 4

[86] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1492–1500, 2017. 2

[87] Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen Lin. RepPoints: Point set representation for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9657–9666, 2019. 2, 12

[88] Lewei Yao, Hang Xu, Wei Zhang, Xiaodan Liang, and Zhenguo Li. SM-NAS: Structural-to-modular neural architecture search for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020. 13

[89] Zhuliang Yao, Yue Cao, Shuxin Zheng, Gao Huang, and Stephen Lin. Cross-iteration batch normalization. arXiv preprint arXiv:2002.05712, 2020. 1, 6

[90] Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang. UnitBox: An advanced object detection network. In Proceedings of the 24th ACM international conference on Multimedia, pages 516–520, 2016. 3

[91] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. CutMix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6023–6032, 2019. 3

[92] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. MixUp: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. 3

[93] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7151–7160, 2018. 6

[94] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 13

[95] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4203–4212, 2018. 11

[96] Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, and Qixiang Ye. FreeAnchor: Learning to match anchors for visual object detection. In Advances in Neural Information Processing Systems (NeurIPS), 2019. 12

[97] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6848–6856, 2018. 2

[98] Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 33, pages 9259–9266, 2019. 2, 4, 11

[99] Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-IoU Loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020. 3, 4

[100] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017. 3

[101] Chenchen Zhu, Fangyi Chen, Zhiqiang Shen, and Marios Savvides. Soft anchor-point object detection. arXiv preprint arXiv:1911.12448, 2019. 12

[102] Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 840–849, 2019. 11

如果你对这篇文章有什么疑问或建议,欢迎下面留言提出,我看到会立刻回复!

打赏
未经允许不得转载:马春杰杰 » YOLO v4 论文中英对照翻译 | YOLO v4全文翻译

留个评论吧~ 抢沙发 评论前登陆可免验证码!

私密评论

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址(选填,便于回访^_^)
切换注册

登录

忘记密码 ?

切换登录

注册