Design of Swin Transformer for semantic segmentation of road scenes

Hang Hao; Huang Yingping; Zhang Xurui; Luo Xin

doi:10.12086/oee.2024.230304

Article navigation > Opto-Electronic Engineering > 2024 Vol. 51 > No. 1 > 230304

Next Article Previous Article

Hang H, Huang Y P, Zhang X R, et al. Design of Swin Transformer for semantic segmentation of road scenes[J]. Opto-Electron Eng, 2024, 51(1): 230304. doi: 10.12086/oee.2024.230304

Citation:

Hang H, Huang Y P, Zhang X R, et al. Design of Swin Transformer for semantic segmentation of road scenes[J]. Opto-Electron Eng, 2024, 51(1): 230304. doi: 10.12086/oee.2024.230304

Design of Swin Transformer for semantic segmentation of road scenes

School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China

Fund Project: Projected supported by National Natrual Science Foundation of China (62276167)

More Information

^*Corresponding author: huangyingping@usst.edu.cn

Received Date 14 December 2023

Revised Date 23 January 2024

Accepted Date 24 January 2024

Published Date 25 January 2024

Abstract

Abstract

Road scene semantic segmentation is a crucial task in autonomous driving environment perception. In recent years, Transformer neural networks have been applied in the field of computer vision and have shown excellent performance. Addressing issues such as low semantic segmentation accuracy in complex scene images and insufficient recognition capabilities for small objects, this paper proposes a road scene semantic segmentation algorithm based on Swin Transformer with multiscale feature fusion. The network adopts an encoder-decoder structure, where the encoder utilizes an improved Swin Transformer feature extractor for road scene image feature extraction. The decoder consists of an attention fusion module and a feature pyramid network, effectively integrating semantic features at multiple scales. Validation tests on the Cityscapes urban road scene dataset show that, compared to various existing semantic segmentation algorithms, our approach demonstrates significant improvement in segmentation accuracy.
- semantic segmentation /
- Swin Transformer /
- attention mechanism /
- autonomous driving /
- deep learning

FullText(HTML)

References

[1]	Mo Y J, Wu Y, Yang X N, et al. Review the state-of-the-art technologies of semantic segmentation based on deep learning[J]. Neurocomputing, 2022, 493: 626−646. doi: 10.1016/j.neucom.2022.01.005 CrossRef Google Scholar
[2]	Liu X L, Deng Z D, Yang Y H. Recent progress in semantic image segmentation[J]. Artif Intell Rev, 2019, 52(2): 1089−1106. doi: 10.1007/s10462-018-9641-3 CrossRef Google Scholar
[3]	张莹, 黄影平, 郭志阳, 等. 基于点云与图像交叉融合的道路分割方法[J]. 光电工程, 2021, 48(12): 210340. doi: 10.12086/oee.2021.210340 CrossRef Google Scholar Zhang Y, Huang Y P, Guo Z Y, et al. Point cloud-image data fusion for road segmentation[J]. Opto-Electron Eng, 2021, 48(12): 210340. doi: 10.12086/oee.2021.210340 CrossRef Google Scholar
[4]	Chiu M T, Xu X Q, Wei Y C, et al. Agriculture-vision: a large aerial image database for agricultural pattern analysis[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 2825–2835. https://doi.org/10.1109/CVPR42600.2020.00290. Google Scholar
[5]	Qureshi I, Yan J H, Abbas Q, et al. Medical image segmentation using deep semantic-based methods: a review of techniques, applications and emerging trends[J]. Inf Fusion, 2023, 90: 316−352. doi: 10.1016/j.inffus.2022.09.031 CrossRef Google Scholar
[6]	Chua L O, Roska T. The CNN paradigm[J]. IEEE Trans Circuits Syst I:Fundam Theory Appl, 1993, 40(3): 147−156. doi: 10.1109/81.222795 CrossRef Google Scholar
[7]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000–6010. Google Scholar
[8]	Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3431–3440. https://doi.org/10.1109/CVPR.2015.7298965. Google Scholar
[9]	Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation[C]//18th International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015: 234–241. https://doi.org/10.1007/978-3-319-24574-4_28. Google Scholar
[10]	Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6230–6239. https://doi.org/10.1109/CVPR.2017.660. Google Scholar
[11]	Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[C]//3rd International Conference on Learning Representations, 2015. Google Scholar
[12]	Chen L C, Papandreou G, Kokkinos I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Trans Pattern Anal Mach Intell, 2018, 40(4): 834−848. doi: 10.1109/TPAMI.2017.2699184 CrossRef Google Scholar
[13]	Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation[Z]. arXiv: 1706.05587, 2017. https://doi.org/10.48550/arXiv.1706.05587. Google Scholar
[14]	Chen L C, Zhu Y K, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 833–851. https://doi.org/10.1007/978-3-030-01234-2_49. Google Scholar
[15]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: transformers for image recognition at scale[C]//9th International Conference on Learning Representations, 2021. Google Scholar
[16]	Liu Z, Lin Y T, Cao Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986. Google Scholar
[17]	Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 936–944. https://doi.org/10.1109/CVPR.2017.106. Google Scholar
[18]	Xie S N, Tu Z W. Holistically-nested edge detection[C]//Proceedings of 2015 IEEE International Conference on Computer Vision, 2015: 1395–1403. https://doi.org/10.1109/ICCV.2015.164. Google Scholar
[19]	Felzenszwalb P F, Huttenlocher D P. Efficient graph-based image segmentation[J]. Int J Comput Vis, 2004, 59(2): 167−181. doi: 10.1023/B:VISI.0000022288.19776.77 CrossRef Google Scholar
[20]	Sehar U, Naseem M L. How deep learning is empowering semantic segmentation: traditional and deep learning techniques for semantic segmentation: a comparison[J]. Multimed Tools Appl, 2022, 81(21): 30519−30544. doi: 10.1007/s11042-022-12821-3 CrossRef Google Scholar
[21]	Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems, 2012: 1106–1114. Google Scholar
[22]	He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770–778. https://doi.org/10.1109/CVPR.2016.90. Google Scholar
[23]	Stergiou A, Poppe R, Kalliatakis G. Refining activation downsampling with SoftPool[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 10337–10346. https://doi.org/10.1109/ICCV48922.2021.01019. Google Scholar
[24]	马梁, 苟于涛, 雷涛, 等. 基于多尺度特征融合的遥感图像小目标检测[J]. 光电工程, 2022, 49(4): 210363. doi: 10.12086/oee.2022.210363 CrossRef Google Scholar Ma L, Gou Y T, Lei T, et al. Small object detection based on multi-scale feature fusion using remote sensing images[J]. Opto-Electron Eng, 2022, 49(4): 210363. doi: 10.12086/oee.2022.210363 CrossRef Google Scholar
[25]	Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 3213–3223. https://doi.org/10.1109/CVPR.2016.350. Google Scholar
[26]	Ulku I, Akagündüz E. A survey on deep learning-based architectures for semantic segmentation on 2D images[J]. Appl Artif Intell, 2022, 36(1): 2032924. doi: 10.1080/08839514.2022.2032924 CrossRef Google Scholar

Overview

Overview

Semantic segmentation of road scenes is a crucial task for the perception of autonomous driving environments. In recent years, deep learning technologies have elevated research in semantic segmentation, leading to the emergence of numerous new algorithms. Methods based on deep learning train models with extensive data automatically extract data features and become the mainstream approach for semantic segmentation. Currently, deep learning algorithms applied to image semantic segmentation primarily fall into two categories: those based on CNN and those based on Transformer. CNN-based image semantic segmentation algorithms such as FCN, PSPNet, U-Net, and DeepLab have made significant contributions to the field. Transformer is a novel architecture based on self-attention, initially applied in the NLP domain. With powerful feature extraction capabilities, Transformer can capture long-range dependencies between feature vectors, acquiring richer contextual information. Researchers have gradually adapted Transformers to the computer vision domain, forming various Visual Transformers. Subsequently, the Swin Transformer stands out, employing a hierarchical structure to output multi-scale features, calculating local self-attention within a window, achieving information interaction between windows through shift-window operations, and demonstrating excellent performance in various visual tasks. Despite extensive research on semantic segmentation algorithms for road scenes, existing methods still face challenges in practical applications. Addressing issues such as low segmentation accuracy in complex scene images and inadequate recognition of small targets, this paper proposes a road scene semantic segmentation algorithm based on the SwinTransformer with multi-scale feature fusion. The network adopts an encoder-decoder structure, where the encoder employs an improved SwinTransformer feature extractor for feature extraction in road scene images, reducing information loss during downsampling and retaining as many edge features as possible. The decoder consists of an attention fusion module and a feature pyramid network, effectively integrating multi-scale semantic features and efficiently restoring fine-grained details in urban road images. We conduct quantitative and qualitative experiments on the Cityscapes urban road scene dataset. The results show that, compared to various existing semantic segmentation algorithms, our method exhibits significant improvements in segmentation accuracy. However, our network structure is relatively complex, with a large number of computations and parameters. In practical applications, further refinement, optimization of the network structure, and lightweight processing to reduce parameters and computations are still required.