
- Chinese Optics Letters
- Vol. 20, Issue 8, 081101 (2022)
Abstract
1. Introduction
Object detection and tracking have been crucial concerns in the 2D[
Online MOT algorithms can be reduced to the two-step and one-shot approaches. The two-step approach[
Previous efforts have been made to solve these problems. IPGAT[
In this Letter, we are committed to coping with both aforementioned problems simultaneously. To address the problem of scale changes, we design a feature-aligned attention network (FAANet), which is mainly composed of two modules: the channel and spatial attention (CSA) module and feature-aligned aggregation (FAA) module. The CSA module adaptively enhances multi-scale features, and the FAA module successively generates alignment bias of two different resolution features. FAANet integrates multi-scale features to improve the robustness of object scale changes. To address the problem of real-time, we adopt the joint-detection-embedding (JDE) paradigm[
The major contributions of this paper are summarized as follows. (1) We propose an FAANet to enhance and aggregate multi-scale features so as to cope with drastic scale changes in UAV videos. (2) We introduce the JDE paradigm to MOT in UAV videos and use the structural re-parameterization technique to increase the network inference speed. (3) Extensive experiments are conducted on UAV detection and tracking (UAVDT)[
2. Methods
In this section, we first present the architecture of the proposed FAANet and then explain the details of the CSA module, FAA module, and online inference.
2.1. Overview
As shown in Fig. 1, the framework of proposed FAANet contains four components: feature extractor backbone, feature fusion neck, detection and Re-ID prediction heads, and online tracking association. We adopt RepVGG[
Figure 1.Architecture of our tracker FAANet tracking framework. This framework contains four components: backbone (RepVGG), neck (CSA + FAA), head (Re-ID + detection), and association.
2.2. Channel and spatial attention module
In general, the input and output dimensions of the channel attention (CA) or spatial attention (SA) are the same[
Figure 2.Architecture of CSA module.
Specifically, let the one output of the backbone be
In contrast to channel attention, which focuses on channel dimension, spatial attention mainly focuses on the height and width dimension. Inspired by polarized self-attention[
Because the channel attention adopts one-dimensional convolution instead of the full connection as usual, the number of parameters is greatly reduced, and the inference speed is improved. The number of channels is reduced to 64 dimensions through the proposed CSA module, which also reduces the number of parameters for real-time performance and provides a channel consistent input to the FAA module.
2.3. Feature-aligned aggregation module
Traditional multi-scale feature aggregation usually adopts the method of bi-linear interpolation up-sampling and element-size summation. However, because of the feature misalignment, feature degradation will occur, which leads to the degradation of the ability to locate objects at different scales. After our meticulous research, we refer to the feature-aligned mechanism in AlignSeg[
The FAA module is shown in Fig. 3. Specifically, let the corresponding two outputs of the CSA module be
Figure 3.Architecture of FAA module.
2.4. Online inference
Herein, the two important components of online inference are data association and structural re-parameterization, which we will illustrate further.
We follow the standard online tracking algorithm to associate boxes. As shown in Fig. 4, we first initialize a few tracklets based on the estimated boxes in the first frame and use a Kalman filter to predict the locations of the tracklets in the next frame. We perform two Hungarian matchings between detections and tracklets sequentially. The first matching considers the appearance information (Re-ID embedding) measured by cosine distance and the motion information measured by Mahalanobis distance. The second matching only considers intersection over union (IOU) distance, which is simple but useful. Finally, we initialize new tracklets that meet confidence thresholds and mark the lost tracklets.
Figure 4.Procedure of association between detections and tracklets.
Structural re-parameterization is used in RepVGG[
Figure 5.Structural re-parameterization of a RepVGG block.
3. Experiment
3.1. Datasets and metrics
We evaluate our method on the dataset UAVDT[
3.2. Implementation details
We choose RepVGG[
3.3. Experiment analysis
We evaluate our FAANet together with various classic and recent algorithms including CEM[
Since most of the current UAV-based MOT algorithms follow the TBD paradigm, many algorithms do not publish their own speed or only publish the speed of the association phase. That leads to difficulty and ambiguity in speed comparison of algorithms. As much as we can, we collect the currently publicly available algorithm speed and performance, which are shown in Fig. 6. Note that the three algorithms in comparison only calculate the time consumption in the association phase. It can be observed that the speed of our FAANet is 60 times higher than that of the current state-of-the-art M-CMSN-M[
MOT Methods | Year | Framework | MOTA | IDF1 | MOTP | MT | ML | FP | FN | IDS | FM | FPS |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SORT[ | 2016 | Faster RCNN | 39.0 | 43.7 | 74.3 | 33.9 | 28.0 | 33,037 | 172,628 | 2350 | 5787 | Nan |
DeepSORT[ | 2017 | Faster RCNN | 40.7 | 58.2 | 73.2 | 41.7 | 23.7 | 44,868 | 155,290 | 2061 | 6432 | 15.01 |
DeepAlign[ | 2018 | Faster RCNN | 41.6 | 49.0 | 73.3 | 43.7 | 24.3 | 45,420 | 152,224 | 1546 | 0.23 | |
SBMA[ | 2019 | LSTM | 38.6 | 48.5 | 72.1 | 38.9 | 24.4 | 44,724 | 160,950 | 3489 | 11,796 | Nan |
IPGAT[ | 2020 | LSTM + CGAN | 39.0 | 49.4 | 72.2 | 37.4 | 25.2 | 42,135 | 163,837 | 2091 | 10,057 | Nan |
M-CMSN-M[ | 2020 | Faster RCNN | 43.1 | 62.6 | 73.5 | 45.3 | 22.7 | 45,900 | 147,638 | 4259 | 0.64 | |
Quadruplet[ | 2021 | Faster RCNN | 40.3 | 55.0 | 74.0 | Nan | Nan | 150,837 | 1091 | 3057 | Nan | |
FAANet | Nan | RepVGG + JDE | 57,146 | 403 | 7202 |
Table 1. Results of a Quantitative Comparison among Classic MOT Methods and Recent UAV-Based Methods on the UAVDT Test Dataset
Figure 6.MOTA-IDF1-FPS comparison with other UAV-based MOT trackers on the UAVDT test dataset. The horizontal axis is FPS, the vertical axis is MOTA, and the radius of the circle is IDF1.
The comparison based on scene attributes is shown in Fig. 7. Our algorithm performs better than other algorithms in the high-alt, bird-view, and fog scenes. It demonstrates the effectiveness of the proposed module.
Figure 7.IDF1 comparison with other UAV-based MOT trackers on the UAVDT test dataset based on scene attributes. The IDF1 of FAANet is marked outside the circle.
3.4. Ablation experiments
To validate the effectiveness of CA, SA, and FAA modules, we introduce a baseline RepVGG-B0 with a re-parameterization technique. The baseline reduces the feature dimension by
RepVGG-B0 | CA | SA | FAA | MOTA | IDF1 | FPS |
---|---|---|---|---|---|---|
38.2 | 56.8 | |||||
39.7 | 59.2 | 43.52 | ||||
39.3 | 59.4 | 43.41 | ||||
40.4 | 60.2 | 41.35 | ||||
42.1 | 63.7 | 40.54 | ||||
38.24 |
Table 2. Evaluation of the Critical Factors in FAANet
As shown in Table 3, we illustrate the speed improvement of the re-parameterization technique. It decreases the number of model parameters from 15.9 × 106 to 14.4 × 106 and the amount of floating-point operations (FLOPs) from 62.3 × 109 to 58.3 × 109. Generally, it increases frames per second (FPS) from 30.32 to 38.24. This is a 26% speed improvement without degeneration of any accuracy performance.
Rep | Params (106) | FLOPs (109) | MOTA | IDF1 | FPS |
---|---|---|---|---|---|
15.9 | 62.3 | 44.0 | 64.6 | 30.32 | |
14.4 | 58.3 | 44.0 | 64.6 | 38.24 |
Table 3. The Improvement of Re-parameterization Technique
3.5. Visualization results
Figure 8 visualizes several typical scenes tracking comparison results between DeepSORT[
Figure 8.Examples and comparison of tracking results between DeepSORT and FAANet on the UAVDT test dataset.
The vertical numbers denote the number of objects tracked in the three frames. On average, FAANet can track 29% more objects than the classical DeepSORT[
4. Conclusions
In this Letter, we propose an FAANet for MOT in UAV videos. Experimental results demonstrate that our methods can better cope with the problem of scale changes and small object with real-time speed. We hope that our method is attractive for application to industry due to its high accuracy and fast speed.
References
[1] A. Bewley, Z. Ge, L. Ott, F. Ramos, B. Upcroft. Simple online and realtime tracking. IEEE International Conference on Image Processing, 3464(2016).
[2] N. Wojke, A. Bewley, D. Paulus. Simple online and realtime tracking with a deep association metric. IEEE International Conference on Image Processing, 3645(2017).
[3] Q. Qian, Y. Hu, N. Zhao, M. Li, F. Shao, X. Zhang. Object tracking method based on joint global and local feature descriptor of 3D LIDAR point cloud. Chin. Opt. Lett., 18, 061001(2020).
[4] J. Dai, L. Huang, K. Guo, L. Ling, H. Huang. Reflectance transformation imaging of 3D detection for subtle traces. Chin. Opt. Lett., 19, 031101(2021).
[5] D. Ramanan, D. A. Forsyth. Finding and tracking people from the bottom up. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1(2003).
[6] Z. Wang, L. Zheng, Y. Liu, Y. Li, S. Wang. Towards real-time multi-object tracking. European Conference on Computer Vision, 107(2020).
[7] Y. Zhang, C. Wang, X. Wang, W. Zeng, W. Liu. FairMOT: on the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis., 129, 3069(2021).
[8] H. Yu, G. Li, L. Su, B. Zhong, Q. Huang. Conditional GAN based individual and global motion fusion for multiple object tracking in UAV videos. Pattern Recognit. Lett., 131, 219(2020).
[9] H. Yu, G. Li, W. Zhang, Q. Huang, D. Du, Q. Tian, N. Sebe. The unmanned aerial vehicle benchmark: object detection, tracking and baseline. Int. J. Comput. Vis., 128, 1141(2020).
[10] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, J. Sun. RepVGG: making VGG-style ConvNets great again. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 13728(2021).
[11] Q. Wang, B. Wu, P. Zhu, P. Li, Q. Hu. ECA-Net: efficient channel attention for deep convolutional neural networks. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 11531(2020).
[12] H. Liu, F. Liu, X. Fan, D. Huang. Polarized self-attention: towards high-quality pixel-wise regression(2021).
[13] Z. Huang, Y. Wei, X. Wang, H. Shi, T. S. Huang. AlignSeg: feature-aligned segmentation networks. IEEE Trans. Pattern Anal. Mach. Intell., 44, 550(2021).
[14] A. Milan, S. Roth, K. Schindler. Continuous energy minimization for multitarget tracking. IEEE Trans. Pattern Anal. Mach. Intell., 36, 58(2013).
[15] S. H. Bae, K. J. Yoon. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1218(2014).
[16] H. Pirsiavash, D. Ramanan, C. C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1201(2011).
[17] E. Bochinski, V. Eiselein, T. Sikora. High-speed tracking-by-detection without using image information. IEEE International Conference on Advanced Video and Signal Based Surveillance, 1(2017).
[18] Y. Xiang, A. Alahi, S. Savarese. Learning to track: online multi-object tracking by decision making. IEEE International Conference on Computer Vision, 4705(2015).
[19] C. Dicle, O. I. Camps, M. Sznaier. The way they move: tracking multiple targets with similar appearance. IEEE International Conference on Computer Vision, 2304(2013).
[20] Q. Zhou, B. Zhong, Y. Zhang, J. Li, Y. Fu. Deep alignment network based multi-person tracking with occlusion and motion reasoning. IEEE Trans. Multimed., 21, 1183(2018).
[21] H. Yu, G. Li, W. Zhang, H. Yao, Q. Huang. Self-balance motion and appearance model for multi-object tracking in UAV. Proceedings of the ACM Multimedia Asia, 1(2019).
[22] H. U. Dike, Y. Zhou. A robust quadruplet and faster region-based CNN for UAV video-based multiple object tracking in crowded environment. Electronics, 10, 795(2021).

Set citation alerts for the article
Please enter your email address