Global pooling residual classification network guided by local attention

Jiang Wentao; Dong Rui; Zhang Shengchong

doi:10.12086/oee.2024.240126

Article navigation > Opto-Electronic Engineering > 2024 Vol. 51 > No. 7 > 240126

Next Article Previous Article

Jiang W T, Dong R, Zhang S C. Global pooling residual classification network guided by local attention[J]. Opto-Electron Eng, 2024, 51(7): 240126. doi: 10.12086/oee.2024.240126

Citation:

Jiang W T, Dong R, Zhang S C. Global pooling residual classification network guided by local attention[J]. Opto-Electron Eng, 2024, 51(7): 240126. doi: 10.12086/oee.2024.240126

Global pooling residual classification network guided by local attention

1.
College of Software, Liaoning Technical University, Huludao, Liaoning 125105, China
2.
Science and Technology on Electro-Optical Information Security Control Laboratory, Tianjin 300308, China

Fund Project: Project supported by National Defense Pre-Research Foundation of China (172068), Liaoning Provincial Natural Science Foundation of China (20170540426), and Key Fund of Liaoning Provincial Department of Education (LJYL049)

More Information

^*Corresponding author: 2669213806@qq.com

Received Date 28 May 2024

Revised Date 03 August 2024

Accepted Date 05 August 2024

Published Date 20 August 2024

Abstract

Abstract

Most attention mechanisms, while enhancing image features, do not consider the impact of local feature interaction on overall feature representation. To address this issue, this paper proposes a global pooling residual classification network guided by local attention (MSLENet). The baseline network for MSLENet was ResNet34. First, the initial layer structure was modified to retain important image information. Second, a multiple segmentation local enhancement attention mechanism (MSLE) module was introduced. The MSLE module first segmented the image into multiple small images, then enhanced the local features of each small image, and finally integrated these important local features into the global features through feature group interaction. Lastly, a pooling residual (PR) module was proposed to address the information loss problem in the ResNet residual structure and improve the information utilization between layers. The experimental results show that by enhancing the interaction of local features, MSLENet achieves good performance on multiple datasets and effectively improves the expressive ability of the network.
- image classification /
- attention mechanism /
- residual structure /
- local features /
- global features /
- interaction

FullText(HTML)

References

[1]	Robbins H, Monro S. A stochastic approximation method[J]. Ann Math Statist, 1951, 22(3): 400−407. doi: 10.1214/aoms/1177729586 CrossRef Google Scholar
[2]	Yang H, Li J. Label contrastive learning for image classification[J]. Soft Comput, 2023, 27(18): 13477−13486. doi: 10.1007/s00500-022-07808-z CrossRef Google Scholar
[3]	Lecun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proc IEEE, 1998, 86(11): 2278−2324. doi: 10.1109/5.726791 CrossRef Google Scholar
[4]	Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[J]. Commun ACM, 2017, 60(6): 84−90. doi: 10.1145/3065386 CrossRef Google Scholar
[5]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learning Representations, San Diego, 2015. Google Scholar
[6]	Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015: 1–9. https://doi.org/10.1109/CVPR.2015.7298594. Google Scholar
[7]	He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016: 770–778. https://doi.org/10.1109/CVPR.2016.90. Google Scholar
[8]	Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 2261–2269. https://doi.org/10.1109/CVPR.2017.243. Google Scholar
[9]	Abdi M, Nahavandi S. Multi-residual networks: improving the speed and accuracy of residual networks[Z]. arXiv: 1609.05672, 2017. https://arxiv.org/abs/1609.05672. Google Scholar
[10]	Howar A G, Zhu M L, Chen B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[Z]. arXiv: 1704.04861, 2017. https://arxiv.org/abs/1704.04861. Google Scholar
[11]	Zhang X Y, Zhou X Y, Lin M X, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018: 6848–6856. https://doi.org/10.1109/CVPR.2018.00716. Google Scholar
[12]	Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, 2015: 2017–2025. Google Scholar
[13]	Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018: 7132–7141. https://doi.org/10.1109/CVPR.2018.00745. Google Scholar
[14]	Yang Z X, Zhu L C, Wu Y, et al. Gated channel transformation for visual recognition[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 11794–11801. https://doi.org/10.1109/CVPR42600.2020.01181. Google Scholar
[15]	张峰, 黄仕鑫, 花强, 等. 基于Depth-wise卷积和视觉Transformer的图像分类模型[J]. 计算机科学, 2024, 51(2): 196−204. doi: 10.11896/jsjkx.221100234 CrossRef Google Scholar Zhang F, Huang S X, Hua Q, et al. Novel image classification model based on depth-wise convolution neural network and visual transformer[J]. Comput Sci, 2024, 51(2): 196−204. doi: 10.11896/jsjkx.221100234 CrossRef Google Scholar
[16]	Hou Q B, Zhou D Q, Feng J S. Coordinate attention for efficient mobile network design[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 2021: 13708–13717. https://doi.org/10.1109/CVPR46437.2021.01350. Google Scholar
[17]	Wang Q L, Wu B G, Zhu P F, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 11531–11539. https://doi.org/10.1109/CVPR42600.2020.01155. Google Scholar
[18]	Zhong Z, Zheng L, Kang G L, et al. Random erasing data augmentation[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, 2020. https://doi.org/10.1609/aaai.v34i07.7000. Google Scholar
[19]	Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016: 2818–2826. https://doi.org/10.1109/CVPR.2016.308. Google Scholar
[20]	Zhang H Y, Cissé M, Dauphin Y N, et al. mixup: beyond empirical risk minimization[C]//Proceedings of the 6th International Conference on Learning Representations, Vancouver, 2018. Google Scholar
[21]	Polyak B T. Some methods of speeding up the convergence of iteration methods[J]. USSR Comput Math Math Phys, 1964, 4(5): 1−17. doi: 10.1016/0041-5553(64)90137-5 CrossRef Google Scholar
[22]	Zhang K, Guo Y R, Wang X S, et al. Channel-wise and feature-points reweights densenet for image classification[C]// Proceedings of 2019 IEEE International Conference on Image Processing, Taipei, China, 2019: 410–414. https://doi.org/10.1109/ICIP.2019.8802982. Google Scholar
[23]	Tan M X, Le Q V. EfficientNet: rethinking model scaling for convolutional neural networks[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, 2019: 6105–6114. Google Scholar
[24]	付晓, 沈远彤, 李宏伟, 等. 基于半监督编码生成对抗网络的图像分类模型[J]. 自动化学报, 2020, 46(3): 531−539. doi: 10.16383/j.aas.c180212 CrossRef Google Scholar Fu X, Shen Y T, Li H W, et al. A semi-supervised encoder generative adversarial networks model for image classification[J]. Acta Autom Sin, 2020, 46(3): 531−539. doi: 10.16383/j.aas.c180212 CrossRef Google Scholar
[25]	Choromanski K M, Likhosherstov V, Dohan D, et al. Rethinking attention with performers[C]//Proceedings of the 9th International Conference on Learning Representations, 2021. Google Scholar
[26]	Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks[C]//Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, 2011: 315–323. Google Scholar
[27]	王方, 乔瑞萍. 用于图像分类的深度卷积神经网络中的空间分割注意力模块[J]. 西安交通大学学报, 2023, 57(9): 185−192. doi: 10.7652/xjtuxb202309019 CrossRef Google Scholar Wang F, Qiao R P. SPAM: spatially partitioned attention module in deep convolutional neural networks for image classification[J]. J Xi'an Jiaotong Univ, 2023, 57(9): 185−192. doi: 10.7652/xjtuxb202309019 CrossRef Google Scholar
[28]	杨萌林, 张文生. 分类激活图增强的图像分类算法[J]. 计算机科学与探索, 2020, 14(1): 149−158. doi: 10.3778/j.issn.1673-9418.1902025 CrossRef Google Scholar Yang M L, Zhang W S. Image classification algorithm based on classification activation map enhancement[J]. J Front Comput Sci Technol, 2020, 14(1): 149−158. doi: 10.3778/j.issn.1673-9418.1902025 CrossRef Google Scholar
[29]	Konstantinidis D, Papastratis I, Dimitropoulos K, et al. Multi-manifold attention for vision transformers[J]. IEEE Access, 2023, 11: 123433−123444. doi: 10.1109/ACCESS.2023.3329952 CrossRef Google Scholar
[30]	Touvron H, Cord M, Sablayrolles A, et al. Going deeper with image transformers[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 2021: 32–42. https://doi.org/10.1109/ICCV48922.2021.00010. Google Scholar
[31]	Liu Z, Lin Y T, Cao Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 2021: 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986. Google Scholar
[32]	Woo S, Debnath S, Hu R H, et al. ConvNeXt V2: co-designing and scaling ConvNets with masked autoencoders[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, 2023: 16133–1614. https://doi.org/10.1109/CVPR52729.2023.01548. Google Scholar

Overview

Overview

In image classification tasks, it has been demonstrated through various experiments that attention mechanisms can significantly enhance a model’s generalization ability. However, most attention mechanisms only focus on enhancing the importance of local or global features, without considering that the interrelationships between local features can also affect the overall image features. To address this issue and improve the model’s generalization ability, this paper proposes a global pooling residual classification network guided by local attention (MSLENet). MSLENet uses ResNet34 as its baseline network. It first modifies the initial convolution structure by replacing the convolution method and removing the pooling layer, allowing the network to retain the basic information of the image and enhance the utilization of detailed information. Secondly, this paper introduces a multiple segmentation local enhancement attention mechanism (MSLE) module, which enhances the information relationship between local and global features and amplifies local key information. The MSLE module consists of three sequential components: the multiple segmentation (MS) module, the local enhancement (LE) module, and the guide module. The MS module uniformly segments the image to fully utilize local information. The LE module enhances the local features of each segmented image and amplifies the local important information of the enhanced segments, thereby improving the interaction among local features and increasing the utilization of local key information. The guide module directs important local features into global features through the interaction between feature layers and different feature groups, thus enhancing the global important features and the network’s expressiveness. Finally, to address the issue of information loss in the residual structure of ResNet, the pooling residual (PR) module is proposed. The PR module modifies the residual structure of ResNet34 by replacing the convolution method in the residual structure with pooling operations, thereby improving the information utilization between layers and reducing the network’s overfitting. Experimental results show that MSLENet achieves accuracy rates of 96.93%, 82.51%, 97.22%, 72.82%, 97.39%, 89.70%, and 95.44% on the CIFAR-10, CIFAR-100, SVHN, STL-10, GTSRB, Imagenette, and NWPU-RESISC45 datasets, respectively. Compared to other networks or modules, MSLENet demonstrates improved performance, proving that the interaction between local and global features, the comprehensive utilization of both local and global information, and the guidance of important local features to global features effectively enhance the network’s accuracy.