Citation: | Hang H, Huang Y P, Zhang X R, et al. Design of Swin Transformer for semantic segmentation of road scenes[J]. Opto-Electron Eng, 2024, 51(1): 230304. doi: 10.12086/oee.2024.230304 |
[1] | Mo Y J, Wu Y, Yang X N, et al. Review the state-of-the-art technologies of semantic segmentation based on deep learning[J]. Neurocomputing, 2022, 493: 626−646. doi: 10.1016/j.neucom.2022.01.005 |
[2] | Liu X L, Deng Z D, Yang Y H. Recent progress in semantic image segmentation[J]. Artif Intell Rev, 2019, 52(2): 1089−1106. doi: 10.1007/s10462-018-9641-3 |
[3] | 张莹, 黄影平, 郭志阳, 等. 基于点云与图像交叉融合的道路分割方法[J]. 光电工程, 2021, 48(12): 210340. doi: 10.12086/oee.2021.210340 Zhang Y, Huang Y P, Guo Z Y, et al. Point cloud-image data fusion for road segmentation[J]. Opto-Electron Eng, 2021, 48(12): 210340. doi: 10.12086/oee.2021.210340 |
[4] | Chiu M T, Xu X Q, Wei Y C, et al. Agriculture-vision: a large aerial image database for agricultural pattern analysis[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 2825–2835. https://doi.org/10.1109/CVPR42600.2020.00290. |
[5] | Qureshi I, Yan J H, Abbas Q, et al. Medical image segmentation using deep semantic-based methods: a review of techniques, applications and emerging trends[J]. Inf Fusion, 2023, 90: 316−352. doi: 10.1016/j.inffus.2022.09.031 |
[6] | Chua L O, Roska T. The CNN paradigm[J]. IEEE Trans Circuits Syst I:Fundam Theory Appl, 1993, 40(3): 147−156. doi: 10.1109/81.222795 |
[7] | Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000–6010. |
[8] | Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3431–3440. https://doi.org/10.1109/CVPR.2015.7298965. |
[9] | Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation[C]//18th International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015: 234–241. https://doi.org/10.1007/978-3-319-24574-4_28. |
[10] | Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6230–6239. https://doi.org/10.1109/CVPR.2017.660. |
[11] | Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[C]//3rd International Conference on Learning Representations, 2015. |
[12] | Chen L C, Papandreou G, Kokkinos I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Trans Pattern Anal Mach Intell, 2018, 40(4): 834−848. doi: 10.1109/TPAMI.2017.2699184 |
[13] | Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation[Z]. arXiv: 1706.05587, 2017. https://doi.org/10.48550/arXiv.1706.05587. |
[14] | Chen L C, Zhu Y K, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 833–851. https://doi.org/10.1007/978-3-030-01234-2_49. |
[15] | Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: transformers for image recognition at scale[C]//9th International Conference on Learning Representations, 2021. |
[16] | Liu Z, Lin Y T, Cao Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986. |
[17] | Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 936–944. https://doi.org/10.1109/CVPR.2017.106. |
[18] | Xie S N, Tu Z W. Holistically-nested edge detection[C]//Proceedings of 2015 IEEE International Conference on Computer Vision, 2015: 1395–1403. https://doi.org/10.1109/ICCV.2015.164. |
[19] | Felzenszwalb P F, Huttenlocher D P. Efficient graph-based image segmentation[J]. Int J Comput Vis, 2004, 59(2): 167−181. doi: 10.1023/B:VISI.0000022288.19776.77 |
[20] | Sehar U, Naseem M L. How deep learning is empowering semantic segmentation: traditional and deep learning techniques for semantic segmentation: a comparison[J]. Multimed Tools Appl, 2022, 81(21): 30519−30544. doi: 10.1007/s11042-022-12821-3 |
[21] | Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems, 2012: 1106–1114. |
[22] | He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770–778. https://doi.org/10.1109/CVPR.2016.90. |
[23] | Stergiou A, Poppe R, Kalliatakis G. Refining activation downsampling with SoftPool[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, 2021: 10337–10346. https://doi.org/10.1109/ICCV48922.2021.01019. |
[24] | 马梁, 苟于涛, 雷涛, 等. 基于多尺度特征融合的遥感图像小目标检测[J]. 光电工程, 2022, 49(4): 210363. doi: 10.12086/oee.2022.210363 Ma L, Gou Y T, Lei T, et al. Small object detection based on multi-scale feature fusion using remote sensing images[J]. Opto-Electron Eng, 2022, 49(4): 210363. doi: 10.12086/oee.2022.210363 |
[25] | Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 3213–3223. https://doi.org/10.1109/CVPR.2016.350. |
[26] | Ulku I, Akagündüz E. A survey on deep learning-based architectures for semantic segmentation on 2D images[J]. Appl Artif Intell, 2022, 36(1): 2032924. doi: 10.1080/08839514.2022.2032924 |
Semantic segmentation of road scenes is a crucial task for the perception of autonomous driving environments. In recent years, deep learning technologies have elevated research in semantic segmentation, leading to the emergence of numerous new algorithms. Methods based on deep learning train models with extensive data automatically extract data features and become the mainstream approach for semantic segmentation. Currently, deep learning algorithms applied to image semantic segmentation primarily fall into two categories: those based on CNN and those based on Transformer. CNN-based image semantic segmentation algorithms such as FCN, PSPNet, U-Net, and DeepLab have made significant contributions to the field. Transformer is a novel architecture based on self-attention, initially applied in the NLP domain. With powerful feature extraction capabilities, Transformer can capture long-range dependencies between feature vectors, acquiring richer contextual information. Researchers have gradually adapted Transformers to the computer vision domain, forming various Visual Transformers. Subsequently, the Swin Transformer stands out, employing a hierarchical structure to output multi-scale features, calculating local self-attention within a window, achieving information interaction between windows through shift-window operations, and demonstrating excellent performance in various visual tasks. Despite extensive research on semantic segmentation algorithms for road scenes, existing methods still face challenges in practical applications. Addressing issues such as low segmentation accuracy in complex scene images and inadequate recognition of small targets, this paper proposes a road scene semantic segmentation algorithm based on the SwinTransformer with multi-scale feature fusion. The network adopts an encoder-decoder structure, where the encoder employs an improved SwinTransformer feature extractor for feature extraction in road scene images, reducing information loss during downsampling and retaining as many edge features as possible. The decoder consists of an attention fusion module and a feature pyramid network, effectively integrating multi-scale semantic features and efficiently restoring fine-grained details in urban road images. We conduct quantitative and qualitative experiments on the Cityscapes urban road scene dataset. The results show that, compared to various existing semantic segmentation algorithms, our method exhibits significant improvements in segmentation accuracy. However, our network structure is relatively complex, with a large number of computations and parameters. In practical applications, further refinement, optimization of the network structure, and lightweight processing to reduce parameters and computations are still required.
Network architecture
Swin Transformer architecture
Swin Transformer block
Patch Merging module
FCM module
AFM module
Comparison of segmentation effects of multiple methods in Cityscapes scenes
Comparison of ablation experiment effects