Author Affiliations
1School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China2Key Laboratory of Photonic Technology for Integrated Sensing and Communication, Ministry of Education, Guangzhou 510006, China3Guangdong Provincial Key Laboratory of Information, Guangzhou 510006, China4School of Physical Science and Technology, Northwestern Polytechnical University, Xi'an 710129, Chinashow less
Fig. 1. Examples of small and tiny objects in the AI-TOD dataset (Green boxes representing small objects, while infrared boxes representing tiny objects)
[12] Fig. 2. The complex background leads to losignal-to-noise ratio and low detectability
[6] Fig. 3. Low tolerance of small targets to bounding box perturbations( The top-left, bottom-left, and right images respectively represent small, medium, and large targets. Black indicates the ground truth boxes, while blue and red represent predicted bounding boxes slightly offset in the diagonal direction)
Fig. 4. Four methods of multi-scale representation learning
[76]. (a) Single feature map; (b) Image pyramid;(c) Pyramid feature levels;(d) Feature pyramid network
Fig. 5. PANet network structure
[81] Fig. 6. GCWNet network structure
[114] Fig. 7. Module structure of LSKNet
[127] Fig. 8. Detection methods of four anchor-free mechanisms. (a) ConnerNet; (b) CenterNet; (c) ExtremeNet; (d) FCOS
Fig. 9. DETR network structure
[150] Fig. 10. AnChor DETR network structure
[157] Fig. 11. Four image fusion strategies. (a) Early fusion; (b) Mid-level fusion; (c) Late fusion; (d) Confidence fusion
[169] Fig. 12. YOLOFusion network structure
[182] Fig. 13. Examples of various datasets. (a) DOTA
[13]; (b) AI_TOD
[12]; (c) DIOR
[8]; (d) VisDrone2019
[22]; (e) TT100 K
[218]; (f) BSTID
[219]; (g) TinyPerson
[14]; (h) CityPerson
[25]; (i) WiderPerson
[220]; (j) BIRDSAI
[221]; (k) VEDAI
[222]; (l) MS COCO
[1] Number | Method | Main content | Year | Publication | 1 | CutOut[41] | → | 2017 | arXiv | 2 | Adaptive Resampling[47] | → | 2019 | ICCV | 3 | Mosaic[45] | → | 2019 | arXiv |
|
Table 1. Data augmentation methods
Number | Method | Main content | Year | Publication | 1 | CARAFE[58] |  | 2019 | CVPR | 2 | Perceptual GAN[68] |  | 2017 | CVPR | 3 | MTGAN[71] |  | 2020 | IJCV |
|
Table 2. Super-resolution methods
Method | Model | Advantage | Disadvantage | Data Augmentation | MixUp[42]CutMix[43]Mosaic[45] | Increasing small object samples to address issues with limited visual information for small targets | Heavily relies on specific datasets. May introduce new noise, impairing the performance of feature extraction | Super Resolution | CARAFE[58]Perceptual GAN[68]MTGAN[71] | "By understanding the connections between small and large targets, repair certain small object details | Facing a trade-off between high computational load and performance optimization. GANs may generate false artifacts | Multi-scale Feature Perception and Fusion | FPN[76]PANet[78]AFF[88] | Enhancing with deep semantic-rich features while retaining the spatial richness of shallow features | Prone to interference from noise and computational burdens | Contextual Information Learning | CoupleNet[103]PyramidBox[104]GCWNet[114] | Utilize the connection between the target and its surrounding targets and environment to provide more information for the network | Redundant contextual information can lead to information noise | Large Kernel Convolution | ConvNeXt[124]LSKNet[127]]YOLO-MS[129]] | A larger receptive field can effectively capture remote dependencies and contextual information | Introducing huge computational overhead, which is not conducive to real-time detection | Anchor-free | CenterNet[138]FCOS[141]]YOLOX[143] | Avoiding complex anchor box calculations | Often results in inaccurate bounding boxes | DETR | DETR[151]CF-DETR[154]RT-DERT[19] | Avoids complex convolutional neural-based designs and post-processing | The training process is slow | Dual-mode | Wagner, et al[170]Liu, et al[174]YOLOFusion[182] | Improve detection performance and robustness. Especially in complex environments | Increase computational costs and system complexity |
|
Table 3. Summary of advantages and disadvantages of small object detection methods
Model | BackBone | AP | AP0.50 | AP0.75 | APS | APM | APL | Year | 注:字体加粗表示该模型在此指标精度第一,下划线表示第二,波浪线表示第三 | FPN[76] | ResNet101 | 36.2 | 59.1 | 39.0 | 18.2 | 39.0 | 48.2 | 2017 | PANet[84] | ResNeXt101 | 40.0 | 62.8 | 43.1 | 18.8 | 42.3 | 57.2 | 2018 | FCOS[140] | ResNet101 | 41.5 | 60.7 | 45.0 | 24.4 | 44.8 | 51.6 | 2019 | YOLOX-L[143] | Modified CSP v5 | 50.0 | 68.5 | 54.5 | 29.8 | 54.5 | 64.4 | 2021 | QueryDet[209] | ResNeXt10 | 44.7 | 65.6 | 47.4 | 29.1 | 47.5 | 53.1 | 2022 | RTMDet-m[128] | CSPDarkNet | 49.3 | 66.9 | 53.9 | 30.5 | 53.6 | 66.1 | 2022 | DN-DETR[162] | ResNet101+DC5 | 47.3 | 67.5 | 50.8 | 28.6 | 51.5 | 65.0 | 2022 | YOLOMS[129] | CSPDarkNet | 51.0 | 68.6 | 55.7 | 33.1 | 56.1 | 66.5 | 2023 | RT-DETR[19] | ResNet101 | 54.3 | 72.7 | 58.6 | 36.0 | 58.8 | 72.1 | 2023 |
|
Table 4. Brief performance evaluation on the MS COCO dataset
Model | BackBone | AP0.50 | Year | | Model | BackBone | AP0.50 | Year | 注:字体加粗表示该模型在此指标精度第一,下划线表示第二,波浪线表示第三 | YOLOv2[40] | DarkNet19 | 25.4 | 2017 | | PP-YOLOE-R[149] | CSPRepResNet | 80.7 | 2022 | CenterNet[138] | ResNet101 | 59. 1 | 2019 | | RTMDet-L[128] | CSPDarkNet53 | 81.3 | 2022 | CADNet[106] | ResNet101 | 69.9 | 2019 | | Info-FPN[98] | ResNet50 | 80.9 | 2023 | SLA[201] | ResNet50 | 76.3 | 2021 | | PCI[115] | ReResNet50 | 80.2 | 2023 |
|
Table 5. Brief performance evaluation on the DOTA dataset
Model | BackBone | AP | AP0.50 | AP0.75 | APvt | APt | APs | APm | Year | 注:字体加粗表示该模型在此指标精度第一,下划线表示第二,波浪线表示第三 | Faster R-CNN[17] | ResNet50 | 12.4 | 28.3 | 8.1 | 0.0 | 8.4 | 26.3 | 36.2 | 2015 | Cascade R-CNN[207] | ResNet50 | 14.4 | 32.7 | 10.6 | 0.0 | 9.9 | 28.3 | 39.9 | 2018 | FSAF[140] | ResNet50 | 14.4 | 35.3 | 8.4 | 3.4 | 14.4 | 19.9 | 24.2 | 2019 | TOOD[145] | ResNet50 | 18.6 | 43.0 | 12.7 | 3.2 | 16.5 | 26.9 | 39.2 | 2021 | M-CenterNet[13] | DLA-34 | 14.5 | 40.7 | 6.4 | 6.1 | 15.0 | 19.4 | 20.4 | 2021 | FasterR-CNN/NWD[199] | ResNet50 | 20.5 | 51.5 | 12.4 | 5.8 | 20.3 | 25.4 | 35.7 | 2021 | Faster R-CNN/RFLA[202] | ResNet50 | 21.1 | 51.6 | 13.1 | 9.5 | 21.2 | 26.1 | 31.5 | 2022 | FSANet[95] | ResNet50 | 16.3 | 41.4 | 9.8 | 4.4 | 14.6 | 23.4 | 33.3 | 2022 | Faster R-CNN/ADAS-GPM[203] | ResNet50 | 22.3 | 53.7 | 13.5 | 7.1 | 21.9 | 27.5 | 35.1 | 2023 |
|
Table 6. Brief performance evaluation on the AI-TOD dataset
Model | $ \mathrm{AP}_{50}^{\mathrm{tiny}1} $ | $ \mathrm{AP}_{50_{ }}^{\mathrm{tiny}2} $ | $ \mathrm{AP}_{50_{^{ }}}^{\mathrm{tiny}3} $ | $ {{\rm{AP}}} _{{5 0}}^{{\mathrm{tiny}}} $ | APall | APy | APy | Year | 注:字体加粗表示该模型在此指标精度第一,下划线表示第二,波浪线表示第三 | Cascade R-CNN[207] | 45.21 | 60.06 | 65.06 | 57.19 | 70.71 | 76.99 | 8.56 | 2018 | FCOS[141] | 3.39 | 12.39 | 29.25 | 16.90 | 35.75 | 40.49 | 1.45 | 2019 | Faster RCNN-SPPNet[90] | 47.56 | 62.36 | 66.15 | 59.13 | 71.17 | 79.47 | 8.62 | 2021 | FPN-SM[14] | 33.91 | 55.16 | 62.58 | 51.33 | 66.96 | 71.55 | 6.46 | 2021 | Faster R-CNN-RFLA[202] | 32.80 | 55.60 | 60.60 | 50.10 | 65.30 | 69.90 | 5.90 | 2022 | SODNe[116] | 40.53 | 59.52 | 64.62 | 55.55 | 66.22 | 75.98 | 7.61 | 2022 | FENet[97] | 37.02 | 55.03 | 62.44 | 51.33 | 66.92 | 72.81 | 6.20 | 2023 |
|
Table 7. Brief performance evaluation on the TinyPerson dataset
Model | Small | | Medium | | Large | Year | Rec | Acc | F1 | | Rec | Acc | F1 | | Rec | Acc | F1 | 注:字体加粗表示该模型在此指标精度第一,下划线表示第二,波浪线表示第三 | PerceptuaGAN[68] | 89.0 | 84.0 | 86.4 | | 96.0 | 91.0 | 93.4 | | 89.0 | 91.0 | 89.9 | 2017 | FPN[76] | 86.4 | 80.1 | 83.1 | | 93.9 | 94.0 | 93.3 | | 92.2 | 92.2 | 92.2 | 2017 | Noh, et al[70] | 92.6 | 84.9 | 88.6 | | 97.5 | 94.5 | 96.0 | | 97.5 | 93.3 | 95.4 | 2019 | EFPN[63] | 92.3 | 85.7 | 88.9 | | 96.7 | 95.7 | 96.2 | | 97.1 | 94.3 | 95.7 | 2021 | SODNet[116] | 90.0 | 85.5 | 87.6 | | 96.6 | 95.8 | 96.2 | | - | - | - | 2022 | AFPN[94] | 92.7 | 85.1 | 88.7 | | 97.7 | 95.3 | 96.5 | | 97.7 | 94.3 | 96.0 | 2022 |
|
Table 8. Brief performance evaluation on the TT-100 K dataset