Citation: |
|
[1] | 许茗, 于晓升, 陈东岳, 等.复杂热红外监控场景下行人检测[J].中国图象图形学报, 2018, 23(12): 1829–1837. doi: 10.11834/jig.180299 Xu M, Yu X S, Chen D Y, et al. Pedestrian detection in complex thermal infrared surveillance scene[J]. Journal of Image and Graphics, 2018, 23(12): 1829–1837. doi: 10.11834/jig.180299 |
[2] | Zheng L, Shen L Y, Tian L, et al. Scalable person re-identification: a benchmark[C]//Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, 2015: 1116–1124. |
[3] | Dai Z Z, Chen M Q, Zhu S Y, et al. Batch feature erasing for person re-identification and beyond[Z]. arXiv: 1811.07130[cs: CV], 2018. |
[4] | Wu A C, Zheng W S, Yu H X, et al. RGB-infrared cross-modality person re-identification[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, 2017: 2380–7504. |
[5] | Dai P Y, Ji R R, Wang H B, et al. Cross-modality person re-identification with generative adversarial training[C]// Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, 2018: 677–683. |
[6] | Ye M, Wang Z, Lan X Y, et al. Visible thermal person re-identification via dual-constrained top-ranking[C]// Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Palo Alto, 2018: 1092–1099. |
[7] | Gray D, Tao H. Viewpoint invariant pedestrian recognition with an ensemble of localized features[C]//Proceedings of the 10th European Conference on Computer Vision, Marseille, France, 2008: 262–275. |
[8] | Wang X G, Doretto G, Sebastian T, et al. Shape and appearance context modeling[C]//Proceedings of the 11th International Conference on Computer Vision, Rio de Janeiro, 2007: 1–8. |
[9] | Li W, Zhao R, Xiao T, et al. DeepReID: deep filter pairing neural network for person re-identification[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, 2014: 152–159. |
[10] | Huang Y, Xu J S, Wu Q, et al. Multi-pseudo regularized label for generated data in person re-identification[J]. IEEE Transactions on Image Processing, 2018, 28(3): 1391–1403. |
[11] | Liu J W, Zha Z J, Tian Q, et al. Multi-scale triplet CNN for person re-identification[C]//Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 2016: 192–196. |
[12] | Qian X L, Fu Y W, Jiang Y G, et al. Multi-scale deep learning architectures for person re-identification[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, 2017: 5399–5408. |
[13] | Chen Y B, Zhu X T, Gong S G. Person re-identification by deep learning multi-scale representations[C]//Proceedings of 2017 IEEE International Conference on Computer Vision Workshops, Venice, 2017: 2590–2600. |
[14] | Li X, Zheng W S, Wang X J, et al, Gong S. Multi-scale learning for low-resolution person re-identification[C]//Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, 2015: 3765–3773. |
[15] | Wang Z, Hu R M, Yu Y, et al. Scale-adaptive low-resolution person re-identification via learning a discriminating surface[C]//Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, 2016: 2669–2675. |
[16] | Jing X Y, Zhu X K, Wu F, et al. Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning[J]. IEEE Transactions on Image Processing, 2017, 26(3): 1363–1378. |
[17] | Zhang D Q, Li W J. Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, Quebec City, 2014: 2177–2183. |
[18] | Chen Y C, Zhu X T, Zheng W S, et al. Person re-identification by camera correlation aware feature augmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(2): 392–408. |
[19] | Zhu X F, Huang Z, Shen H T, et al. Linear cross-modal hashing for efficient multimedia search[C]//Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, 2013: 143–152. |
[20] | Zhai D M, Chang H, Zhen Y, et al. Parametric local multimodal hashing for cross-view similarity search[C]//Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, 2013: 2754–2760. |
[21] | Srivastava N, Salakhutdinov R. Multimodal learning with deep Boltzmann machines[J]. Journal of Machine Learning Research, 2014, 15(84): 2949–2980. |
[22] | Nguyen D T, Hong H G, Kim K W, et al. Person recognition system based on a combination of body images from visible light and thermal cameras[J]. Sensors, 2017, 17(3): 605. |
[23] | Sarfraz M S, Stiefelhagen R. Deep perceptual mapping for cross-modal face recognition[J]. International Journal of Computer Vision, 2017, 122(3): 426–438. |
[24] | Xiao T, Li H S, Ouyang W L, et al. Learning deep feature representations with domain guided dropout for person re-identification[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016: 1249–1258. |
[25] | Wang F Q, Zuo W M, Lin L, et al. Joint learning of single-image and cross-image representations for person re-identification[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016: 1288–1296. |
[26] | Jiang X Y, Wu F, Li X, et al. Deep compositional cross-modal learning to rank via local-global alignment[C]//Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, 2015: 69–78. |
[27] | Møgelmose A, Bahnsen C, Moeslund T B, et al. Tri-modal person re-identification with RGB, depth and thermal features[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, 2013: 301–307. |
[28] | Sun Y F, Zheng L, Deng W J, et al. SVDNet for pedestrian retrieval[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, 2017: 3800–3808. |
[29] | Maas A L, Hannun A Y, Ng A Y. Rectifier nonlinearities improve neural network acoustic models[C]//Proceedings of 30th International Conference on Machine Learning, Atlanta, Georgia, 2013: 18–23. |
[30] | Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks[C]//Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, 2011: 315–323. |
[31] | Bottou L. Stochastic gradient descent tricks[M]//Montavon G, Orr G B, Müller K R. Neural Networks: Tricks of the Trade. Berlin, Heidelberg: Springer, 2012: 421–436. |
[32] | Dong C, Loy C C, He K M, et al. Image super-resolution using deep convolutional networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 38(2): 295–307. |
Overview: Existing works in person re-identification only considers extracting invariant feature representations from cross-view visible cameras, which ignores the imaging feature in infrared domain, such that there are few studies on visible-infrared relevant modality. Besides, most works distinguish two-views by often computing the similarity in feature maps from one single convolutional layer, which causes a weak performance of learning features. To handle the above problems, we design a feature pyramid random fusion network (FPRnet). Firstly, we introduce SRCNN of a super-resolution reconstruction method to preprocess, and the purpose is to alleviate the interference of IR-images blur and make feature learning more robust. Secondly, we take ResNet-50 pre-trained on ImageNet dataset as a baseline to learn feature representations of images in RGB-domain and IR-domain. The re-identification based on the residual network can only learn features with one resolution scale. However, tracking a specific person requires multi-directional learning, including the pedestrian's overall properties, local attributes and important characteristics to reduce the occurrence of misjudgment. For this reason, referring to the thought of the feature pyramid network, the features of different convolution layers in ResNet-50 network are constructed into a pyramid structure. It can calculate the similarity between multiple features at the same time, and abandons the approach of the original pyramid network using different scales to adapt to pedestrian bounding box images. Instead, it embeds the spirit of the pyramid structure into the depth residual network as a feature extraction module to extract the IR-RGB block. This learning method integrates the advantages of learning local and global feature, and represents the features with strong semantics and strong geometric details. Then, the random fusion mechanism is used as the basis of the feature fusion module to complete the end-to-end design of the double-branch, and the fusion block is obtained, which can avoid the problem of excessive parameters in the pyramid model. Thirdly, after the tasks of feature extraction and feature fusion are completed, the cross-modality prediction is carried out. It consists of a blue cross-domain, a pink RGB-domain, and a purple IR-domain. It generates three types of classification loss, and then uses a hybrid loss function to reduce the gaps between the intra-modality visual appearance and the inter-modality heterogeneity issue. The IR-RGB block and fusion block use a minimax game to beat each other for learning the joint-modal classification loss. Finally, the original dataset is utilized for FPRnet testing. Extensive experiments on the public SYSU-MM01 dataset from aspects of mAP and convergence speed, demonstrate the superiorities in our approach to the state-of-the-art methods. Furthermore, FPRnet also achieves competitive results with 32.12% mAP recognition rate and much faster convergence. The source code of the FPRnet can be available from https://github.com/KyreneLaura/FPRnet.
An illustration of the framework of feature pyramid random fusion network. (a) The model generates a top-level joint feature (termed IR-RGB block) and a random fusion feature (termed fusion block). Specifically, the IR-RGB block is concatenated by RGB block with IR block; the fusion block is generated by randomly blend features from different levels and distinct modalities; (b) The prediction consists of a blue cross-domain, a pink RGB-domain, and a purple IR-domain, which generates three types of classification loss. The IR-RGB block and fusion block use a minimax game to beat each other for learning the joint-modal classification loss
Feature selection. r and i represent RGB and IR domain respectively. Features P(5), P(4), P(3), and P(2) are shown by green, orange, blue and pink respectively
The method of feature fusion. (a) Horizontal concatenation; (b) Vertical concatenation; (c) Hybrid concatenation. Let r be the RGB-modality, and i be the IR-modality
Comparison with state of the art on SYSU-MM01
The trend of mAP during training
The trend of loss during training