• Infrared and Laser Engineering
  • Vol. 53, Issue 9, 20240253 (2024)
Genghuan LIU1,2,3, Xiangjin ZENG1,2,3, Jiazhen DOU1,2,3, Zhenbo REN4,*..., Liyun ZHONG1,2,3, Jianglei DI1,2,3 and Yuwen QIN1,2,3|Show fewer author(s)
Author Affiliations
  • 1School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China
  • 2Key Laboratory of Photonic Technology for Integrated Sensing and Communication, Ministry of Education, Guangzhou 510006, China
  • 3Guangdong Provincial Key Laboratory of Information, Guangzhou 510006, China
  • 4School of Physical Science and Technology, Northwestern Polytechnical University, Xi'an 710129, China
  • show less
    DOI: 10.3788/IRLA20240253 Cite this Article
    Genghuan LIU, Xiangjin ZENG, Jiazhen DOU, Zhenbo REN, Liyun ZHONG, Jianglei DI, Yuwen QIN. Review of advances in small object detection technology based on deep learning (invited)[J]. Infrared and Laser Engineering, 2024, 53(9): 20240253 Copy Citation Text show less
    Examples of small and tiny objects in the AI-TOD dataset (Green boxes representing small objects, while infrared boxes representing tiny objects)[12]
    Fig. 1. Examples of small and tiny objects in the AI-TOD dataset (Green boxes representing small objects, while infrared boxes representing tiny objects)[12]
    The complex background leads to losignal-to-noise ratio and low detectability[6]
    Fig. 2. The complex background leads to losignal-to-noise ratio and low detectability[6]
    Low tolerance of small targets to bounding box perturbations( The top-left, bottom-left, and right images respectively represent small, medium, and large targets. Black indicates the ground truth boxes, while blue and red represent predicted bounding boxes slightly offset in the diagonal direction)
    Fig. 3. Low tolerance of small targets to bounding box perturbations( The top-left, bottom-left, and right images respectively represent small, medium, and large targets. Black indicates the ground truth boxes, while blue and red represent predicted bounding boxes slightly offset in the diagonal direction)
    Four methods of multi-scale representation learning[76]. (a) Single feature map; (b) Image pyramid;(c) Pyramid feature levels;(d) Feature pyramid network
    Fig. 4. Four methods of multi-scale representation learning[76]. (a) Single feature map; (b) Image pyramid;(c) Pyramid feature levels;(d) Feature pyramid network
    PANet network structure[81]
    Fig. 5. PANet network structure[81]
    GCWNet network structure[114]
    Fig. 6. GCWNet network structure[114]
    Module structure of LSKNet[127]
    Fig. 7. Module structure of LSKNet[127]
    Detection methods of four anchor-free mechanisms. (a) ConnerNet; (b) CenterNet; (c) ExtremeNet; (d) FCOS
    Fig. 8. Detection methods of four anchor-free mechanisms. (a) ConnerNet; (b) CenterNet; (c) ExtremeNet; (d) FCOS
    DETR network structure[150]
    Fig. 9. DETR network structure[150]
    AnChor DETR network structure[157]
    Fig. 10. AnChor DETR network structure[157]
    Four image fusion strategies. (a) Early fusion; (b) Mid-level fusion; (c) Late fusion; (d) Confidence fusion[169]
    Fig. 11. Four image fusion strategies. (a) Early fusion; (b) Mid-level fusion; (c) Late fusion; (d) Confidence fusion[169]
    YOLOFusion network structure[182]
    Fig. 12. YOLOFusion network structure[182]
    Examples of various datasets. (a) DOTA[13]; (b) AI_TOD[12]; (c) DIOR[8]; (d) VisDrone2019[22]; (e) TT100 K[218]; (f) BSTID[219]; (g) TinyPerson[14]; (h) CityPerson[25]; (i) WiderPerson[220]; (j) BIRDSAI[221]; (k) VEDAI[222]; (l) MS COCO[1]
    Fig. 13. Examples of various datasets. (a) DOTA[13]; (b) AI_TOD[12]; (c) DIOR[8]; (d) VisDrone2019[22]; (e) TT100 K[218]; (f) BSTID[219]; (g) TinyPerson[14]; (h) CityPerson[25]; (i) WiderPerson[220]; (j) BIRDSAI[221]; (k) VEDAI[222]; (l) MS COCO[1]
    NumberMethodMain contentYearPublication
    1CutOut[41]2017arXiv
    2Adaptive Resampling[47]2019ICCV
    3Mosaic[45]2019arXiv
    Table 1. Data augmentation methods
    NumberMethodMain contentYearPublication
    1CARAFE[58]2019CVPR
    2Perceptual GAN[68]2017CVPR
    3MTGAN[71]2020IJCV
    Table 2. Super-resolution methods
    MethodModelAdvantageDisadvantage
    Data AugmentationMixUp[42]CutMix[43]Mosaic[45]Increasing small object samples to address issues with limited visual information for small targetsHeavily relies on specific datasets. May introduce new noise, impairing the performance of feature extraction
    Super ResolutionCARAFE[58]Perceptual GAN[68]MTGAN[71]"By understanding the connections between small and large targets, repair certain small object detailsFacing a trade-off between high computational load and performance optimization. GANs may generate false artifacts
    Multi-scale Feature Perception and FusionFPN[76]PANet[78]AFF[88]Enhancing with deep semantic-rich features while retaining the spatial richness of shallow featuresProne to interference from noise and computational burdens
    Contextual Information LearningCoupleNet[103]PyramidBox[104]GCWNet[114]Utilize the connection between the target and its surrounding targets and environment to provide more information for the networkRedundant contextual information can lead to information noise
    Large Kernel ConvolutionConvNeXt[124]LSKNet[127]]YOLO-MS[129]]A larger receptive field can effectively capture remote dependencies and contextual informationIntroducing huge computational overhead, which is not conducive to real-time detection
    Anchor-freeCenterNet[138]FCOS[141]]YOLOX[143]Avoiding complex anchor box calculationsOften results in inaccurate bounding boxes
    DETRDETR[151]CF-DETR[154]RT-DERT[19]Avoids complex convolutional neural-based designs and post-processingThe training process is slow
    Dual-modeWagner, et al[170]Liu, et al[174]YOLOFusion[182]Improve detection performance and robustness. Especially in complex environmentsIncrease computational costs and system complexity
    Table 3. Summary of advantages and disadvantages of small object detection methods
    ModelBackBoneAPAP0.50AP0.75APSAPMAPLYear
     注:字体加粗表示该模型在此指标精度第一,下划线表示第二,波浪线表示第三
    FPN[76]ResNet10136.259.139.018.239.048.22017
    PANet[84]ResNeXt10140.062.843.118.842.357.22018
    FCOS[140]ResNet10141.560.745.024.444.851.62019
    YOLOX-L[143]Modified CSP v550.068.554.529.854.564.42021
    QueryDet[209]ResNeXt1044.765.647.429.147.553.12022
    RTMDet-m[128]CSPDarkNet49.366.953.930.553.666.12022
    DN-DETR[162]ResNet101+DC547.367.550.828.651.565.02022
    YOLOMS[129]CSPDarkNet51.068.655.733.156.166.52023
    RT-DETR[19]ResNet10154.372.758.636.058.872.12023
    Table 4. Brief performance evaluation on the MS COCO dataset
    ModelBackBoneAP0.50YearModelBackBoneAP0.50Year
     注:字体加粗表示该模型在此指标精度第一,下划线表示第二,波浪线表示第三
    YOLOv2[40]DarkNet1925.42017PP-YOLOE-R[149]CSPRepResNet80.72022
    CenterNet[138]ResNet10159. 12019RTMDet-L[128]CSPDarkNet5381.32022
    CADNet[106]ResNet10169.92019Info-FPN[98]ResNet5080.92023
    SLA[201]ResNet5076.32021PCI[115]ReResNet5080.22023
    Table 5. Brief performance evaluation on the DOTA dataset
    ModelBackBoneAPAP0.50AP0.75APvtAPtAPsAPmYear
     注:字体加粗表示该模型在此指标精度第一,下划线表示第二,波浪线表示第三
    Faster R-CNN[17]ResNet5012.428.38.10.08.426.336.22015
    Cascade R-CNN[207]ResNet5014.432.710.60.09.928.339.92018
    FSAF[140]ResNet5014.435.38.43.414.419.924.22019
    TOOD[145]ResNet5018.643.012.73.216.526.939.22021
    M-CenterNet[13]DLA-3414.540.76.46.115.019.420.42021
    FasterR-CNN/NWD[199]ResNet5020.551.512.45.820.325.435.72021
    Faster R-CNN/RFLA[202]ResNet5021.151.613.19.521.226.131.52022
    FSANet[95]ResNet5016.341.49.84.414.623.433.32022
    Faster R-CNN/ADAS-GPM[203]ResNet5022.353.713.57.121.927.535.12023
    Table 6. Brief performance evaluation on the AI-TOD dataset
    Model$ \mathrm{AP}_{50}^{\mathrm{tiny}1} $$ \mathrm{AP}_{50_{ }}^{\mathrm{tiny}2} $$ \mathrm{AP}_{50_{^{ }}}^{\mathrm{tiny}3} $$ {{\rm{AP}}} _{{5 0}}^{{\mathrm{tiny}}} $APallAPyAPyYear
     注:字体加粗表示该模型在此指标精度第一,下划线表示第二,波浪线表示第三
    Cascade R-CNN[207]45.2160.0665.0657.1970.7176.998.562018
    FCOS[141]3.3912.3929.2516.9035.7540.491.452019
    Faster RCNN-SPPNet[90]47.5662.3666.1559.1371.1779.478.622021
    FPN-SM[14]33.9155.1662.5851.3366.9671.556.462021
    Faster R-CNN-RFLA[202]32.8055.6060.6050.1065.3069.905.902022
    SODNe[116]40.5359.5264.6255.5566.2275.987.612022
    FENet[97]37.0255.0362.4451.3366.9272.816.202023
    Table 7. Brief performance evaluation on the TinyPerson dataset
    ModelSmallMediumLargeYear
    RecAccF1RecAccF1RecAccF1
     注:字体加粗表示该模型在此指标精度第一,下划线表示第二,波浪线表示第三
    PerceptuaGAN[68]89.084.086.496.091.093.489.091.089.92017
    FPN[76]86.480.183.193.994.093.392.292.292.22017
    Noh, et al[70]92.684.988.697.594.596.097.593.395.42019
    EFPN[63]92.385.788.996.795.796.297.194.395.72021
    SODNet[116]90.085.587.696.695.896.2---2022
    AFPN[94]92.785.188.797.795.396.597.794.396.02022
    Table 8. Brief performance evaluation on the TT-100 K dataset
    Genghuan LIU, Xiangjin ZENG, Jiazhen DOU, Zhenbo REN, Liyun ZHONG, Jianglei DI, Yuwen QIN. Review of advances in small object detection technology based on deep learning (invited)[J]. Infrared and Laser Engineering, 2024, 53(9): 20240253
    Download Citation