Wang R G, Wang J, Yang J, et al. Feature pyramid random fusion network for visible-infrared modality person re-identification[J]. Opto-Electron Eng, 2020, 47(12): 190669. doi: 10.12086/oee.2020.190669
Citation: Wang R G, Wang J, Yang J, et al. Feature pyramid random fusion network for visible-infrared modality person re-identification[J]. Opto-Electron Eng, 2020, 47(12): 190669. doi: 10.12086/oee.2020.190669

Feature pyramid random fusion network for visible-infrared modality person re-identification

More Information
  • Corresponding author: Yang Juan, E-mail: yangjuan6985@163.com 
  • Existing works in person re-identification only considers extracting invariant feature representations from cross-view visible cameras, which ignores the imaging feature in infrared domain, such that there are few studies on visible-infrared relevant modality. Besides, most works distinguish two-views by often computing the similarity in feature maps from one single convolutional layer, which causes a weak performance of learning features. To handle the above problems, we design a feature pyramid random fusion network (FPRnet) that learns discriminative multiple semantic features by computing the similarities between multi-level convolutions when matching the person. FPRnet not only reduces the negative effect of bias in intra-modality, but also balances the heterogeneity gap between inter-modality, which focuses on an infrared image with very different visual properties. Meanwhile, our work integrates the advantages of learning local and global feature, which effectively solves the problems of visible-infrared person re-identification. Extensive experiments on the public SYSU-MM01 dataset from aspects of mAP and convergence speed, demonstrate the superiorities in our approach to the state-of-the-art methods. Furthermore, FPRnet also achieves competitive results with 32.12% mAP recognition rate and much faster convergence.
  • 加载中
  • [1] 许茗, 于晓升, 陈东岳, 等.复杂热红外监控场景下行人检测[J].中国图象图形学报, 2018, 23(12): 1829–1837. doi: 10.11834/jig.180299

    CrossRef Google Scholar

    Xu M, Yu X S, Chen D Y, et al. Pedestrian detection in complex thermal infrared surveillance scene[J]. Journal of Image and Graphics, 2018, 23(12): 1829–1837. doi: 10.11834/jig.180299

    CrossRef Google Scholar

    [2] Zheng L, Shen L Y, Tian L, et al. Scalable person re-identification: a benchmark[C]//Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, 2015: 1116–1124.

    Google Scholar

    [3] Dai Z Z, Chen M Q, Zhu S Y, et al. Batch feature erasing for person re-identification and beyond[Z]. arXiv: 1811.07130[cs: CV], 2018.

    Google Scholar

    [4] Wu A C, Zheng W S, Yu H X, et al. RGB-infrared cross-modality person re-identification[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, 2017: 2380–7504.

    Google Scholar

    [5] Dai P Y, Ji R R, Wang H B, et al. Cross-modality person re-identification with generative adversarial training[C]// Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, 2018: 677–683.

    Google Scholar

    [6] Ye M, Wang Z, Lan X Y, et al. Visible thermal person re-identification via dual-constrained top-ranking[C]// Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Palo Alto, 2018: 1092–1099.

    Google Scholar

    [7] Gray D, Tao H. Viewpoint invariant pedestrian recognition with an ensemble of localized features[C]//Proceedings of the 10th European Conference on Computer Vision, Marseille, France, 2008: 262–275.

    Google Scholar

    [8] Wang X G, Doretto G, Sebastian T, et al. Shape and appearance context modeling[C]//Proceedings of the 11th International Conference on Computer Vision, Rio de Janeiro, 2007: 1–8.

    Google Scholar

    [9] Li W, Zhao R, Xiao T, et al. DeepReID: deep filter pairing neural network for person re-identification[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, 2014: 152–159.

    Google Scholar

    [10] Huang Y, Xu J S, Wu Q, et al. Multi-pseudo regularized label for generated data in person re-identification[J]. IEEE Transactions on Image Processing, 2018, 28(3): 1391–1403.

    Google Scholar

    [11] Liu J W, Zha Z J, Tian Q, et al. Multi-scale triplet CNN for person re-identification[C]//Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 2016: 192–196.

    Google Scholar

    [12] Qian X L, Fu Y W, Jiang Y G, et al. Multi-scale deep learning architectures for person re-identification[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, 2017: 5399–5408.

    Google Scholar

    [13] Chen Y B, Zhu X T, Gong S G. Person re-identification by deep learning multi-scale representations[C]//Proceedings of 2017 IEEE International Conference on Computer Vision Workshops, Venice, 2017: 2590–2600.

    Google Scholar

    [14] Li X, Zheng W S, Wang X J, et al, Gong S. Multi-scale learning for low-resolution person re-identification[C]//Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, 2015: 3765–3773.

    Google Scholar

    [15] Wang Z, Hu R M, Yu Y, et al. Scale-adaptive low-resolution person re-identification via learning a discriminating surface[C]//Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, 2016: 2669–2675.

    Google Scholar

    [16] Jing X Y, Zhu X K, Wu F, et al. Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning[J]. IEEE Transactions on Image Processing, 2017, 26(3): 1363–1378.

    Google Scholar

    [17] Zhang D Q, Li W J. Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, Quebec City, 2014: 2177–2183.

    Google Scholar

    [18] Chen Y C, Zhu X T, Zheng W S, et al. Person re-identification by camera correlation aware feature augmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(2): 392–408.

    Google Scholar

    [19] Zhu X F, Huang Z, Shen H T, et al. Linear cross-modal hashing for efficient multimedia search[C]//Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, 2013: 143–152.

    Google Scholar

    [20] Zhai D M, Chang H, Zhen Y, et al. Parametric local multimodal hashing for cross-view similarity search[C]//Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, 2013: 2754–2760.

    Google Scholar

    [21] Srivastava N, Salakhutdinov R. Multimodal learning with deep Boltzmann machines[J]. Journal of Machine Learning Research, 2014, 15(84): 2949–2980.

    Google Scholar

    [22] Nguyen D T, Hong H G, Kim K W, et al. Person recognition system based on a combination of body images from visible light and thermal cameras[J]. Sensors, 2017, 17(3): 605.

    Google Scholar

    [23] Sarfraz M S, Stiefelhagen R. Deep perceptual mapping for cross-modal face recognition[J]. International Journal of Computer Vision, 2017, 122(3): 426–438.

    Google Scholar

    [24] Xiao T, Li H S, Ouyang W L, et al. Learning deep feature representations with domain guided dropout for person re-identification[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016: 1249–1258.

    Google Scholar

    [25] Wang F Q, Zuo W M, Lin L, et al. Joint learning of single-image and cross-image representations for person re-identification[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016: 1288–1296.

    Google Scholar

    [26] Jiang X Y, Wu F, Li X, et al. Deep compositional cross-modal learning to rank via local-global alignment[C]//Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, 2015: 69–78.

    Google Scholar

    [27] Møgelmose A, Bahnsen C, Moeslund T B, et al. Tri-modal person re-identification with RGB, depth and thermal features[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, 2013: 301–307.

    Google Scholar

    [28] Sun Y F, Zheng L, Deng W J, et al. SVDNet for pedestrian retrieval[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, 2017: 3800–3808.

    Google Scholar

    [29] Maas A L, Hannun A Y, Ng A Y. Rectifier nonlinearities improve neural network acoustic models[C]//Proceedings of 30th International Conference on Machine Learning, Atlanta, Georgia, 2013: 18–23.

    Google Scholar

    [30] Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks[C]//Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, 2011: 315–323.

    Google Scholar

    [31] Bottou L. Stochastic gradient descent tricks[M]//Montavon G, Orr G B, Müller K R. Neural Networks: Tricks of the Trade. Berlin, Heidelberg: Springer, 2012: 421–436.

    Google Scholar

    [32] Dong C, Loy C C, He K M, et al. Image super-resolution using deep convolutional networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 38(2): 295–307.

    Google Scholar

  • Overview: Existing works in person re-identification only considers extracting invariant feature representations from cross-view visible cameras, which ignores the imaging feature in infrared domain, such that there are few studies on visible-infrared relevant modality. Besides, most works distinguish two-views by often computing the similarity in feature maps from one single convolutional layer, which causes a weak performance of learning features. To handle the above problems, we design a feature pyramid random fusion network (FPRnet). Firstly, we introduce SRCNN of a super-resolution reconstruction method to preprocess, and the purpose is to alleviate the interference of IR-images blur and make feature learning more robust. Secondly, we take ResNet-50 pre-trained on ImageNet dataset as a baseline to learn feature representations of images in RGB-domain and IR-domain. The re-identification based on the residual network can only learn features with one resolution scale. However, tracking a specific person requires multi-directional learning, including the pedestrian's overall properties, local attributes and important characteristics to reduce the occurrence of misjudgment. For this reason, referring to the thought of the feature pyramid network, the features of different convolution layers in ResNet-50 network are constructed into a pyramid structure. It can calculate the similarity between multiple features at the same time, and abandons the approach of the original pyramid network using different scales to adapt to pedestrian bounding box images. Instead, it embeds the spirit of the pyramid structure into the depth residual network as a feature extraction module to extract the IR-RGB block. This learning method integrates the advantages of learning local and global feature, and represents the features with strong semantics and strong geometric details. Then, the random fusion mechanism is used as the basis of the feature fusion module to complete the end-to-end design of the double-branch, and the fusion block is obtained, which can avoid the problem of excessive parameters in the pyramid model. Thirdly, after the tasks of feature extraction and feature fusion are completed, the cross-modality prediction is carried out. It consists of a blue cross-domain, a pink RGB-domain, and a purple IR-domain. It generates three types of classification loss, and then uses a hybrid loss function to reduce the gaps between the intra-modality visual appearance and the inter-modality heterogeneity issue. The IR-RGB block and fusion block use a minimax game to beat each other for learning the joint-modal classification loss. Finally, the original dataset is utilized for FPRnet testing. Extensive experiments on the public SYSU-MM01 dataset from aspects of mAP and convergence speed, demonstrate the superiorities in our approach to the state-of-the-art methods. Furthermore, FPRnet also achieves competitive results with 32.12% mAP recognition rate and much faster convergence. The source code of the FPRnet can be available from https://github.com/KyreneLaura/FPRnet.

  • 加载中
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Figures(6)

Tables(8)

Article Metrics

Article views(4243) PDF downloads(793) Cited by(0)

Access History

Other Articles By Authors

Article Contents

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint