Citation: | Wu M J, Zhang Y A, Lin S L, et al. Real-time semantic segmentation algorithm based on BiLevelNet[J]. Opto-Electron Eng, 2024, 51(5): 240030. doi: 10.12086/oee.2024.240030 |
[1] | Li L H, Qian B, Lian J, et al. Traffic scene segmentation based on RGB-D image and deep learning[J]. IEEE Trans Intell Transp Syst, 2017, 19(5): 1664−1669. doi: 10.1109/TITS.2017.2724138 |
[2] | 梁礼明, 卢宝贺, 龙鹏威, 等. 自适应特征融合级联Transformer视网膜血管分割算法[J]. 光电工程, 2023, 50(10): 230161. doi: 10.12086/oee.2023.230161 Liang L M, Lu B H, Long P W, et al. Adaptive feature fusion cascade transformer retinal vessel segmentation algorithm[J]. Opto-Electron Eng, 2023, 50(10): 230161. doi: 10.12086/oee.2023.230161 |
[3] | 闵锋, 彭伟明, 况永刚, 等. 基于非下采样轮廓波变换的遥感地物分割算法[J]. 电光与控制, 2023, 30(11): 49−55. doi: 10.3969/j.issn.1671-637X.2023.11.008 Min F, Peng W M, Kuang Y G, et al. A remote sensing ground object segmentation algorithm based on non-subsampled contourlet transform[J]. Electron Opt Control, 2023, 30(11): 49−55. doi: 10.3969/j.issn.1671-637X.2023.11.008 |
[4] | Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2881–2890. https://doi.org/10.1109/CVPR.2017.660. |
[5] | 张文博, 瞿珏, 王崴, 等. 融合多尺度特征的改进Deeplab v3+图像语义分割算法[J]. 电光与控制, 2022, 29(11): 12−16,30. doi: 10.3969/j.issn.1671-637X.2022.11.003 Zhang W B, Qu J, Wang W, et al. An improved Deeplab v3+ image semantic segmentation algorithm incorporating multi-scale features[J]. Electron Opt Control, 2022, 29(11): 12−16,30. doi: 10.3969/j.issn.1671-637X.2022.11.003 |
[6] | Howard A, Sandler M, Chen B, et al. Searching for MobileNetV3[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 1314–1324. https://doi.org/10.1109/ICCV.2019.00140. |
[7] | Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 3213–3223. https://doi.org/10.1109/CVPR.2016.350. |
[8] | Brostow G J, Fauqueur J, Cipolla R. Semantic object classes in video: a high-definition ground truth database[J]. Pattern Recognit Lett, 2009, 30(2): 88−97. doi: 10.1016/j.patrec.2008.04.005 |
[9] | Yu C Q, Gao C X, Wang J B, et al. BiSeNet V2: bilateral network with guided aggregation for real-time semantic segmentation[J]. Int J Comput Vis, 2021, 129(11): 3051−3068. doi: 10.1007/s11263-021-01515-2 |
[10] | Zhuang M X, Zhong X Y, Gu D B, et al. LRDNet: a lightweight and efficient network with refined dual attention decorder for real-time semantic segmentation[J]. Neurocomputing, 2021, 459: 349−360. doi: 10.1016/j.neucom.2021.07.019 |
[11] | Romera E, Álvarez J M, Bergasa L M, et al. ERFNet: efficient residual factorized ConvNet for real-time semantic segmentation[J]. IEEE Trans Intell Transp Syst, 2018, 19(1): 263−272. doi: 10.1109/TITS.2017.2750080 |
[12] | Liu J, Zhou Q, Qiang Y, et al. FDDWNet: a lightweight convolutional neural network for real-time semantic segmentation[C]//Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020: 2373–2377. https://doi.org/10.1109/ICASSP40776.2020.9053838. |
[13] | Liu J, Xu X Q, Shi Y Q, et al. RELAXNet: residual efficient learning and attention expected fusion network for real-time semantic segmentation[J]. Neurocomputing, 2022, 474: 115−127. doi: 10.1016/j.neucom.2021.12.003 |
[14] | 林珊玲, 彭雪玲, 林坚普, 等. 多尺度增强特征融合的钢表面缺陷目标检测[J]. 光学精密工程, 2024, 32(7): 1076−1086. doi: 10.37188/OPE.20243207.1075 Lin S L, Peng X L, Lin J P, et al. Object detection of steel surface defect based on multi-scale enhanced feature fusion[J]. Opt Precision Eng, 2024, 32(7): 1076−1086. doi: 10.37188/OPE.20243207.1075 |
[15] | Wang Y, Zhou Q, Liu J, et al. Lednet: a lightweight encoder-decoder network for real-time semantic segmentation[C]//Proceedings of 2019 IEEE International Conference on Image Processing, 2019: 1860–1864. https://doi.org/10.1109/ICIP.2019.8803154. |
[16] | Wei H R, Liu X, Xu S C, et al. DWRSeg: dilation-wise residual network for real-time semantic segmentation[Z]. arXiv: 2212.01173, 2023. https://arxiv.org/abs/2212.01173v1. |
[17] | Chen J R, Kao S H, He H, et al. Run, don't walk: chasing higher FLOPS for faster neural networks[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 12021–12031. https://doi.org/10.1109/CVPR52729.2023.01157. |
[18] | Ma N N, Zhang X Y, Zheng H T, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 116–131. https://doi.org/10.1007/978-3-030-01264-9_8. |
[19] | Woo S, Park J, Lee J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 3–19. https://doi.org/10.1007/978-3-030-01234-2_1. |
[20] | 张冲, 黄影平, 郭志阳, 等. 基于语义分割的实时车道线检测方法[J]. 光电工程, 2022, 49(5): 210378. doi: 10.12086/oee.2022.210378 Zhang C, Huang Y P, Guo Z Y, et al. Real-time lane detection method based on semantic segmentation[J]. Opto-Electron Eng, 2022, 49(5): 210378. doi: 10.12086/oee.2022.210378 |
[21] | Huang Z L, Wang X G, Huang L C, et al. CCNet: criss-cross attention for semantic segmentation[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 603–612. https://doi.org/10.1109/ICCV.2019.00069. |
[22] | 吴刚, 葛芸, 储珺, 等. 面向遥感图像检索的级联池化自注意力研究[J]. 光电工程, 2022, 49(12): 220029. doi: 10.12086/oee.2022.220029 Wu G, Ge Y, Chu J, et al. Cascade pooling self-attention research for remote sensing image retrieval[J]. Opto-Electron Eng, 2022, 49(12): 220029. doi: 10.12086/oee.2022.220029 |
[23] | Xia Z F, Pan X R, Song S J, et al. Vision transformer with deformable attention[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 4794–4803. https://doi.org/10.1109/CVPR52688.2022.00475. |
[24] | Zhu L, Wang X J, Ke Z H, et al. BiFormer: vision transformer with Bi-level routing attention[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 10323–10333. https://doi.org/10.1109/CVPR52729.2023.00995. |
[25] | Wang J Q, Chen K, Xu R, et al. CARAFE: content-aware ReAssembly of FEatures[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 3007–3016. https://doi.org/10.1109/ICCV.2019.00310. |
[26] | 刘春娟, 乔泽, 闫浩文, 等. 基于多尺度互注意力的遥感图像语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(7): 1335−1344. doi: 10.3785/j.issn.1008-973X.2023.07.008 Liu C J, Qiao Z, Yan H W, et al. Semantic segmentation network for remote sensing image based on multi-scale mutual attention[J]. J Zhejiang Univ (Eng Sci), 2023, 57(7): 1335−1344. doi: 10.3785/j.issn.1008-973X.2023.07.008 |
[27] | Lu H, Liu W Z, Fu H T, et al. FADE: fusing the assets of decoder and encoder for task-agnostic upsampling[C]//Proceedings of the 17th European Conference on Computer Vision, 2022: 231–247. https://doi.org/10.1007/978-3-031-19812-0_14. |
[28] | Li H C, Xiong P F, Fan H Q, et al. DFANet: deep feature aggregation for real-time semantic segmentation[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9522–9531. https://doi.org/10.1109/CVPR.2019.00975. |
[29] | Yi Q M, Dai G S, Shi M, et al. ELANet: effective lightweight attention-guided network for real-time semantic segmentation[J]. Neural Process Lett, 2023, 55(5): 6425−6442. doi: 10.1007/s11063-023-11145-z |
[30] | 石敏, 沈佳林, 易清明, 等. 快速超轻量城市交通场景语义分割[J]. 计算机科学与探索, 2022, 16(10): 2377−2386. doi: 10.3778/j.issn.1673-9418.2203015 Shi M, Shen J L, Yi Q M, et al. Rapid and ultra-lightweight semantic segmentation in urban traffic scene[J]. J Front Comput Sci Technol, 2022, 16(10): 2377−2386. doi: 10.3778/j.issn.1673-9418.2203015 |
[31] | 易清明, 张文婷, 石敏, 等. 多尺度特征融合的道路场景语义分割[J]. 激光与光电子学进展, 2023, 60(12): 1210006. doi: 10.3788/LOP220914 Yi Q M, Zhang W T, Shi M, et al. Semantic segmentation for road scene based on multiscale feature fusion[J]. Laser Optoelectron Prog, 2023, 60(12): 1210006. doi: 10.3788/LOP220914 |
[32] | 兰建平, 董冯雷, 杨亚会, 等. 改进STDC-Seg的实时图像语义分割网络算法[J]. 传感器与微系统, 2023, 42(11): 110−113,118. doi: 10.13873/J.1000-9787(2023)11-0110-04 Lan J P, Dong F L, Yang Y H, et al. Real-time image semantic segmentation network algorithm based on improved STDC-Seg[J]. Transducer Microsyst Technol, 2023, 42(11): 110−113,118. doi: 10.13873/J.1000-9787(2023)11-0110-04 |
In response to the challenge posed by the large parameter sizes of semantic segmentation networks, which complicate deployment on memory-constrained edge devices, a lightweight real-time semantic segmentation algorithm based on BiLevelNet is proposed. Initially, dilated convolutions are utilized to broaden the receptive field, and strategies for reusing features are integrated to bolster the network's awareness of regions. Subsequently, a two-stage PBRA (Partial Bi-Level Route Attention) mechanism is adopted to form connections between distant objects, thereby enhancing the network's capability to perceive global contexts. Moreover, the FADE operator is introduced for merging shallow features, thereby augmenting the efficacy of image upsampling.
Within the depicted AFR module in Fig. 4, a variety of hierarchical feature maps are presented, along with descriptions of their characteristics and roles. The distinctions and connections between the input feature map, the local feature map achieved through 3×3 depth convolution, and the context information feature map acquired through dilated convolution are clarified. It is further emphasized how these features are effectively amalgamated in the final fused feature map, showcasing strong activation across both local and global contexts. Additionally, a gradually decreasing channel reduction factor is employed, as elaborated in Table 3. Through the gradual adjustment of the channel reduction factor, it is observed that with a reduction factor of r=1/4, the PBRA module enhances mIoU by 1.5% and boosts speed by 12FPS in comparison to BRA.
Moreover, discontinuities and missing pixels are noted in segmentation results when bilinear interpolation is used for upsampling. Observations of the depth feature maps prior to bilinear upsampling reveal that features corresponding to roads and sidewalks bear similarities, leading to potential misclassifications. To counteract this issue, shallow features that preserve edge information are introduced and merged into the FADE upsampling process, thereby improving edge segmentation. This method effectively addresses the loss of spatial information, resulting in smoother and more defined edge segmentation outcomes.
Experimental outcomes indicate that, at an input image resolution of 512×1024, the network attains an average Intersection over Union (IoU) of 75.1% on the Cityscapes dataset, operating at a speed of 121 frames per second, while maintaining a modest model size of only 0.7M. Furthermore, at an input image resolution of 360×480, the network secures an average IoU of 68.2% on the CamVid dataset. Compared with other real-time semantic segmentation methods, this network maintains an optimal balance between speed and accuracy, fulfilling the real-time operation requirements for applications such as autonomous driving.
Network framework of BiLevelNet
Comparison of different feature extraction modules
AFR-S module and AFR module
Schematic diagram of region perception and feature reuse
BRA module
Partial Bi-Level route attention module
Decorder module
Sample distribution of Cityscapes dataset and Camvid dataset
Comparison of results with using the PBRA modules
Comparison of segmentation results using bilinear interpolation
Feature map before bilinear interpolation and the shallow feature map
FADE upsampling segmentation results
Visualization results of networks on Cityscapes dataset
Visualization results of networks on the Camvid dataset