
- Infrared and Laser Engineering
- Vol. 50, Issue 5, 20200364 (2021)
Abstract
Keywords
0 Introduction
Infrared-visible image patches matching is a fundamental task of infrared-visible image processing. It compares the object or region by analyzing the similarity of content, features, structures, relationships, textures, and grayscales in infrared-visible images. The infrared-visible image matching is often used as a subroutine that plays an important role in a wide variety of applications, such as visual navigation[
Infrared-visible image patches is more challenging compared with traditional visible images. Since infrared and visible sensors use different imaging principles, the images taken by multiple sensors also have more differences than those by a single sensor. The edges of the object are blurred in infrared images. Less texture and color features are found in the object. The infrared-visible image pairs have significant grayscale distortion and illumination change.
Manual descriptors are used to extract features, such as SIFT[
Hand-craft descriptors need to improve continuously for new applications to extract efficient features. The feature extraction and similarity measure are two independent and unrelated stages, which cannot be optimized end-to-end. With the widespread application of deep learning in computer vision, the image patches matching based on deep learning has become a trend. MatchNet[
This paper proposes an infrared-visible image deep matching network (InViNet) to tackle these challenges above. Two CNN branches extract the infrared and visible image features independently. The full connection layers compare their similarity.
In infrared-visible image patches matching, we think that the differences between unrelated patches are still more significant than those within similar patches, even if multi-sensors take the infrared and visible images. The feature extraction subnetwork uses the contrastive loss and triplet loss to maximize the distance of the feature between unrelated patches and minimize it within similar patches. It makes the distribution of the high-level feature more centralized within the intra-classes and more separate between the inter-classes.
For the infrared images, their regions and shapes still have essential references in the infrared-visible image matching. Integrating the spatial features with semantic features is necessary. We combine the multi-scale spatial features with the high-level features to enhance the performance. Compared to the previous CNNs, our method can increase the accuracy from 78.95% to 88.75%.
1 Infrared-visible image patches matching network
1.1 Overview of our network architecture
Our network mainly consists of two parts: the feature extraction network and the metric network, as shown in Fig.1. The feature extraction network is responsible for extracting features in infrared and visible images. The metric network mainly matches the feature’s similarities.
Figure 1.Infrared-visible image deep matching network. The black line with the arrow indicates the data-flow. The blue lines represent shortcut connections through the reshape layers. This figure describes the process of the infrared-visible image patches matching
The feature extraction network extracts the distinguishing features of visible and infrared images. In the feature extraction network, infrared and visible images are input into two VGG16[
For infrared and visible images, although their imaging principles are different, the same target is very similar in semantic features. Therefore, branches share network weights in network design. We believe that deep convolutional networks have strong feature representation capacity. It can extract common feature in infrared and visible images. Multiple network branches that traditionally use contrastive loss or triplet loss generally share weights. The shared weights can map high-level features to the same feature space for distance comparison.
The metric network is composed of two FC layers with softmax loss as the objective function. It estimates the probability of whether the visible image and the infrared image are similar or not. Ideally, if they match, the prediction is 1. If they don’t match, the prediction is 0.
1.2 Multi-scale spatial feature integration
Compared with visible images, infrared images have no color and less texture information. The edges are usually blurred. However, the objects still have rough outlines and region information in infrared images. These outlines and shapes are common features in visible and infrared images. Therefore, we believe that their spatial information is essential in infrared images for image matching. It is necessary to integrate the spatial features with the semantic features to enhance feature representation.
On the other hand, it is feasible to propose features with multiple scales in the deep neural network's hierarchical framework. The features proposed from the low-level layers are similar to those extracted with the hand-craft descriptors, such as SIFT, SURF. As the CNN layers deepen, the feature maps less focus on the imaging difference. The semantic features gradually reveal in the high-level layers. In our network, the multi-scale features are input into the metric network. So, the metric network can use more comprehensive information to make similarity decisions. Each block in our network directly connects to the input of the metric network. It can preserve more multi-scale spatial information for similarity comparison in the metric network, as shown in Fig.2.
Figure 2.Multi-scale spatial feature integration in a single branch. The output feature map in each block shorts to the concatenation layer. The output of the concatenation layer is one input of the metric network
In multi-scale spatial feature extraction, two problems need to be solved. Firstly, the shortcut feature should maintain the original feature maps' size in each block to preserve spatial information. Secondly, the shortcut feature dimensions should not be too high after it reshapes into a vector. The great dimension eventually results in vast parameters and high computation in the metric network.
The 1×1 convolution is adopted in our network to solve the problems. The 1×1 convolution is widely used in GoogLeNet[
1.3 Two shared branches in feature extraction network
As shown in Fig.3, the network of feature extraction consists of two branches. Two branches are identical in structure and share weights. A visible image and an infrared image make up an image pair. The contrastive loss was first used for dimensionality reduction[
Figure 3.(a) Feature extraction network architecture with the contrastive loss; (b) Input data for feature extraction network with the contrastive loss. The visual patches are in the first row. The infrared patches are in the second row. The positive samples are in odd columns. The negative ones are in even columns
The contrastive loss is shown in Eq.(1).
where
1.4 Triplet shared branches in the feature extraction network
As shown in Fig.4 (a), the network consists of triple branches. Three branches are identical in structure and share weights. A visible patch (anchoring sample), an infrared patch (positive sample), and another infrared patch (negative sample) form an image pair. We input a triple pair at a time to train the feature extraction network. The triplet loss was used for face recognition[
Figure 4.(a) Feature extraction network architecture with the triplet loss; (b) Input data for feature extraction network with the triplet loss. The anchor patches are in the first row. The positive patches are in the second row. The negative patches are in the third row. Each column is triple patches input
The triplet loss shows in Eq.(2).
The input data include anchoring sample (
Eq.(3) illustrates that there is a margin between
2 Experiment
2.1 Data set
There are no available infrared-visible image patches matching datasets on the Internet, so we have to collect image pairs ourselves. In data acquisition, the visible camera is the default equipment in the DJI UAV. The infrared camera is manufactured by FLIR company. The wavelength of the infrared camera ranges from 7.5 to 13.5 µm. In terms of image resolution, the UAV acquires infrared and visible images at different altitudes. In the original image, the proportion of the same target to the image size is 0.8×, 0.5×, and 0.25×, respectively. In the following data preprocess, we crop the target area from the original images. The input images of the neural network resize to 224×224. Therefore, we use different resolution images during training and testing.
Our data set contains 2 000 images, falling into 25 classes. For scene selection, the target taken by UAVs should be different in shape and outline. The classes cover bridges, buildings, roads, parking lots, factories, houses, towers, gas storage tanks, etc., as shown in Fig.5. In the data set, the ratio of visible and infrared images is 1∶1. 80% of the images are used as training data. The rest images are used as test data. A sample includes an infrared patch and a visible patch. If the image pairs are similar, they are positive samples. Their ground truth is 1. If they are not similar, they are negative samples, and the ground truth is 0. In the training and test data set, the ratio of positive and negative samples is 1∶1.
Figure 5.Infrared-visible image samples. Ten image pairs randomly was selected. The ground truth of the first five images is 0. The ground truth of the last five columns is 1
2.2 Experiment method
InViNet using two-stage training is better than the traditional classification network. In two-stage training, the feature network can improve the features representation. It can significantly increase the accuracy of the metric network in the latter stage. By comparing the existence of shortcut connections in InViNet, the low-level spatial feature is acknowledged as a useful complement for high-level semantic information. We use the following settings to train our network in two stages.
The feature extraction network is trained in the first stages. The branches in the feature extraction network are initialized with VGG16 trained weights by the ImageNet data set. Xavier[
The metric network and shortcut connections are trained in the second stage. In metric network training, the weights trained well in the feature network are used as the initial value. The branches' weights slightly change during this training. Their learning rate multipliers are less than 10−2. The basic learning rate is 10−3. The weights are initialized with the Xavier method in new layers. Their learning rate multipliers are 1 in the metric network and shortcut layers. The number of epochs is adjusted to 2 500. The rest of the training parameters are the same as the first training.
All experiments run on a computer equipped with Nvidia TITAN XP GPU. Our experiment is implemented with Caffe.
2.3 Experimental result
To validate our approach, we have implemented the following experiments on different network architecture.
(1) Traditional method[
(2) Baseline Network. MatchNet[
(3) MatchNet[
(4) Pseudo-SiamNet[
(5) InViNet (F+C). InViNet with fine-tuning and contrastive loss. We trained this network in two phases, which are described in Sec 2.2.
(6) InViNet (F+C+S). InViNet with fine-tuning, contrastive loss, and shortcut connection. The network adds shortcut connections.
(7) InViNet (F+T+S). InViNet with fine-tuning, triplet loss, and shortcut connection. This network is mainly to compare triplet loss and constrained loss.
The ROC curve usually measures a binary classification performance to avoid the imbalance between positive and negative samples. The commonly used evaluation metric is the false positive rate at 95% recall (Error@95%), the lower the better. Based on the experimental results, ROC curves are drawn for different methods. See Fig.6 for details.
Figure 6.ROC curves for various methods. The numbers in the legends are FPR95 values. In the legend, the symbol “F” means the network uses fine-tuning with VGG16. The symbol “C” means that the contrastive loss is used in the extraction feature network. The symbol “T” means that the triplet loss is used in the extraction feature network. The symbol “S” means that shortcut connection is used
From our experiments, the following conclusions can be summarized.
(1) In infrared-visible image patches matching, it is hard to extract common features in infrared and visible images with traditional methods due to the different imaging principles. The result is not satisfying.
(2) The few samples easily lead to over-fitting when the network is trained from scratch. With the fine-tuning, all deep learning networks show better performance than traditional algorithms. The fine-tuning can avoid over-fitting effectively.
(3) The pseudo-Siamese network performs better than the Siamese network. The explanation may be that the low-level convolution layers don’t share weights in pseudo-Siamese networks. According to the different imaging principles of infrared and visible images, they can extract their unique shallow features from two separate branches.
To be concrete, we visualize the deep learned features of expression using t-SNE[
We show some top-ranking correct and incorrect results in InViNet in Fig.8. We find that incorrect results also may be easily mistaken by a human.
Figure 8.Top-ranking false and true results in overpass and factory image patches. (a) True positive samples; (b) True negative samples; (c) False positive samples; (d) False negative samples
To further analyze our method results, we list the mean average precision (MAP) in the test set, which has five classes. The classes have never been used in the training process. As shown in Fig.9, our InViNet outperforms other approaches.
Figure 7.Visualization of the five class features in the test data set by the feature extraction network. (a) Features from the original network; (b) Features from the network with the contrastive loss; (c) Features from the network with the triplet loss
Figure 9.Performance matching in the test data set. In the legend, the symbols “F”, “C”, “T” and “S” have the same meaning in
3 Conclusions
Given the difficulty of infrared-visible image patches matching, this paper proposes an improved network based on deep learning. Compared to the previous method, our method can increase the accuracy from 78.95% to 88.75%. At present, it is difficult to obtain samples of visible and infrared images. There are many multi-sensor data sets available on the Internet. However, they are not fully utilized because there is no corresponding similar visible image. We believe that we can make full use of many multi-sensor images through unsupervised learning to further improve our matching performance in the future.
References
[1] Weiping Yang, Zhenkang Shen. Matching technique and its application in aided inertial navigation. Infrared and Laser Engineering, 36, 15-17(2007).
[2] Hongguang Li, Wenrui Ding, Xianbin Cao, et al. Image registration and fusion of visible and infrared integrated camera for medium-altitude unmanned aerial vehicle remote sensing. Remote Sensing, 9, 441(2017).
[3] Ning Wang, Ming Zhou, Qinglei Du. A method for infrared visible image fusion and target recognition. Journal of Air Force Early Warning Academy, 33, 328-332(2019).
[4] Yuanhong Mao, Zhanzhuang He, Zhong Ma. Infrared target classification with reconstruction transfer learning. Journal of University of Electronic Science and Technology of China, 49, 609-614(2020).
[5] D G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60, 91-110(2004).
[6] Bay H, Tuytelaars T, Gool L V. SURF: Speeded up robust features[C]European Conference on Computer Vision, 2006, 3951: 404–417.
[7] Rublee E, Rabaud V, Konolige K, et al. B: An efficient alternative to SIFT SURF[C]International Conference on Computer Vision, 2011: 25642571.
[8] A A Sima, S J Buckley. Optimizing SIFT for matching of short wave infrared and visible wavelength images. Remote Sensing, 5, 2037-2056(2013).
[9] D M Li, J L Zhang. A improved infrared and visible images matching based on SURF. Applied Mechanics and Materials, 2418, 1637-1640(2013).
[10] Zhiguo Chao, Bo Wu. Approach on scene matching based on histograms of oriented gradients. Infrared and Laser Engineering, 41, 513-516(2012).
[11] Zhiguo Cao, Ruicheng Yan, Jie Song. Approach on fuzzy shape context matching between infrared images and visible images. Infrared and Laser Engineering, 37, 1095-1100(2008).
[12] Anbo Jiao, Liyun Shao, Chenxi Li, et al. Automatic target recognition algorithm based on affine invariant feature of line grouping. Infrared and Laser Engineering, 48, S226003(2019).
[13] Han X, Leung T, Jia Y, et al. Match: Unifying feature metric learning f patchbased matching[C]IEEE Conference on Computer Vision Pattern Recognition (CVPR), 2015: 32793286.
[14] Zaguyko S, Komodakis N. Learning to compare image patches via convolutional neural wks[C]IEEE Conference on Computer Vision Pattern Recognition (CVPR), 2015: 43534361.
[15] M S Hanif. Patch match networks: Improved two-channel and Siamese networks for image patch matching. Pattern Recognition Letters, 120, 54-61(2019).
[16] Simonyan K, Zisserman A. Very deep convolutional wks f largescale image recognition[C]ICLR 2015: International Conference on Learning Representations, 2015.
[17] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]IEEE Conference on Computer Vision Pattern Recognition (CVPR), 2015: 19.
[18] Hadsell R, Chopra S, LeCun Y. Dimensionality reduction by learning an invariant mapping[C]IEEE Computer Society Conference on Computer Vision Pattern Recognition (CVPR’06), 2006, 2: 17351742.
[19] Schroff F, Kalenichenko D, Philbin J. Face: A unified embedding f face recognition clustering[C]IEEE Conference on Computer Vision Pattern Recognition (CVPR), 2015: 815823.
[20] Glot X, Bengio Y. Understing the difficulty of training deep feedfward neural wks[C]Proceedings of the Thirteenth International Conference on Artificial Intelligence Statistics, 2010: 249256.
[21] der Maaten L Van, G Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579-2605(2008).

Set citation alerts for the article
Please enter your email address