Real-time semantic segmentation algorithm based on BiLevelNet

Wu Majing; Zhang Yong'ai; Lin Shanling; Lin Zhixian; Lin Jianpu

doi:10.12086/oee.2024.240030

Article navigation > Opto-Electronic Engineering > 2024 Vol. 51 > No. 5 > 240030

Next Article Previous Article

Wu M J, Zhang Y A, Lin S L, et al. Real-time semantic segmentation algorithm based on BiLevelNet[J]. Opto-Electron Eng, 2024, 51(5): 240030. doi: 10.12086/oee.2024.240030

Citation:

Wu M J, Zhang Y A, Lin S L, et al. Real-time semantic segmentation algorithm based on BiLevelNet[J]. Opto-Electron Eng, 2024, 51(5): 240030. doi: 10.12086/oee.2024.240030

Real-time semantic segmentation algorithm based on BiLevelNet

1.
School of Advanced Manufacturing, Fuzhou University, Quanzhou, Fujian 362200, China
2.
Fujian Science & Technology Innovation Laboratory for Optoelectronic Information of China, Fuzhou, Fujian 350116, China

Fund Project: Project supported by the National Key R&D Program of China (2023YFB3609400), Fujian Province Natural Science Foundation of China (2020J01468), and Youth Science Foundation of the National Natural Science Foundation of China (62101132)

More Information

^*Corresponding author: ljp@fzu.edu.cn

Received Date 30 January 2024

Revised Date 13 March 2024

Accepted Date 13 March 2024

Published Date 25 May 2024

Abstract

Abstract

In response to the problem of the large parameter size of semantic segmentation networks, making it difficult to deploy on memory-constrained edge devices, a lightweight real-time semantic segmentation algorithm is proposed based on BiLevelNet. Firstly, dilated convolutions are employed to augment the receptive field, and feature reuse strategies are integrated to enhance the network's region awareness. Next, a two-stage PBRA (Partial Bi-Level Route Attention) mechanism is incorporated to establish dependencies between distant objects, thereby augmenting the network's global perception capability. Finally, the FADE operator is introduced to combine shallow features to improve the effectiveness of image upsampling. Experimental results show that, at an input image resolution of 512×1024, the proposed network achieves an average Intersection over Union (IoU) of 75.1% on the Cityscapes dataset at a speed of 121 frames per second, with a model size of only 0.7 M. Additionally, at an input image resolution of 360×480, the network achieves an average IoU of 68.2% on the CamVid dataset. Compared with other real-time semantic segmentation methods, this network achieves a balance between speed and accuracy, meeting the real-time requirements for applications like autonomous driving.
- real-time semantic segmentation /
- autonomous driving /
- deep learning /
- self-attention /
- upsampling

FullText(HTML)

References

[1]	Li L H, Qian B, Lian J, et al. Traffic scene segmentation based on RGB-D image and deep learning[J]. IEEE Trans Intell Transp Syst, 2017, 19(5): 1664−1669. doi: 10.1109/TITS.2017.2724138 CrossRef Google Scholar
[2]	梁礼明, 卢宝贺, 龙鹏威, 等. 自适应特征融合级联Transformer视网膜血管分割算法[J]. 光电工程, 2023, 50(10): 230161. doi: 10.12086/oee.2023.230161 CrossRef Google Scholar Liang L M, Lu B H, Long P W, et al. Adaptive feature fusion cascade transformer retinal vessel segmentation algorithm[J]. Opto-Electron Eng, 2023, 50(10): 230161. doi: 10.12086/oee.2023.230161 CrossRef Google Scholar
[3]	闵锋, 彭伟明, 况永刚, 等. 基于非下采样轮廓波变换的遥感地物分割算法[J]. 电光与控制, 2023, 30(11): 49−55. doi: 10.3969/j.issn.1671-637X.2023.11.008 CrossRef Google Scholar Min F, Peng W M, Kuang Y G, et al. A remote sensing ground object segmentation algorithm based on non-subsampled contourlet transform[J]. Electron Opt Control, 2023, 30(11): 49−55. doi: 10.3969/j.issn.1671-637X.2023.11.008 CrossRef Google Scholar
[4]	Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2881–2890. https://doi.org/10.1109/CVPR.2017.660. Google Scholar
[5]	张文博, 瞿珏, 王崴, 等. 融合多尺度特征的改进Deeplab v3+图像语义分割算法[J]. 电光与控制, 2022, 29(11): 12−16,30. doi: 10.3969/j.issn.1671-637X.2022.11.003 CrossRef Google Scholar Zhang W B, Qu J, Wang W, et al. An improved Deeplab v3+ image semantic segmentation algorithm incorporating multi-scale features[J]. Electron Opt Control, 2022, 29(11): 12−16,30. doi: 10.3969/j.issn.1671-637X.2022.11.003 CrossRef Google Scholar
[6]	Howard A, Sandler M, Chen B, et al. Searching for MobileNetV3[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 1314–1324. https://doi.org/10.1109/ICCV.2019.00140. Google Scholar
[7]	Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 3213–3223. https://doi.org/10.1109/CVPR.2016.350. Google Scholar
[8]	Brostow G J, Fauqueur J, Cipolla R. Semantic object classes in video: a high-definition ground truth database[J]. Pattern Recognit Lett, 2009, 30(2): 88−97. doi: 10.1016/j.patrec.2008.04.005 CrossRef Google Scholar
[9]	Yu C Q, Gao C X, Wang J B, et al. BiSeNet V2: bilateral network with guided aggregation for real-time semantic segmentation[J]. Int J Comput Vis, 2021, 129(11): 3051−3068. doi: 10.1007/s11263-021-01515-2 CrossRef Google Scholar
[10]	Zhuang M X, Zhong X Y, Gu D B, et al. LRDNet: a lightweight and efficient network with refined dual attention decorder for real-time semantic segmentation[J]. Neurocomputing, 2021, 459: 349−360. doi: 10.1016/j.neucom.2021.07.019 CrossRef Google Scholar
[11]	Romera E, Álvarez J M, Bergasa L M, et al. ERFNet: efficient residual factorized ConvNet for real-time semantic segmentation[J]. IEEE Trans Intell Transp Syst, 2018, 19(1): 263−272. doi: 10.1109/TITS.2017.2750080 CrossRef Google Scholar
[12]	Liu J, Zhou Q, Qiang Y, et al. FDDWNet: a lightweight convolutional neural network for real-time semantic segmentation[C]//Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020: 2373–2377. https://doi.org/10.1109/ICASSP40776.2020.9053838. Google Scholar
[13]	Liu J, Xu X Q, Shi Y Q, et al. RELAXNet: residual efficient learning and attention expected fusion network for real-time semantic segmentation[J]. Neurocomputing, 2022, 474: 115−127. doi: 10.1016/j.neucom.2021.12.003 CrossRef Google Scholar
[14]	林珊玲, 彭雪玲, 林坚普, 等. 多尺度增强特征融合的钢表面缺陷目标检测[J]. 光学精密工程, 2024, 32(7): 1076−1086. doi: 10.37188/OPE.20243207.1075 CrossRef Google Scholar Lin S L, Peng X L, Lin J P, et al. Object detection of steel surface defect based on multi-scale enhanced feature fusion[J]. Opt Precision Eng, 2024, 32(7): 1076−1086. doi: 10.37188/OPE.20243207.1075 CrossRef Google Scholar
[15]	Wang Y, Zhou Q, Liu J, et al. Lednet: a lightweight encoder-decoder network for real-time semantic segmentation[C]//Proceedings of 2019 IEEE International Conference on Image Processing, 2019: 1860–1864. https://doi.org/10.1109/ICIP.2019.8803154. Google Scholar
[16]	Wei H R, Liu X, Xu S C, et al. DWRSeg: dilation-wise residual network for real-time semantic segmentation[Z]. arXiv: 2212.01173, 2023. https://arxiv.org/abs/2212.01173v1. Google Scholar
[17]	Chen J R, Kao S H, He H, et al. Run, don't walk: chasing higher FLOPS for faster neural networks[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 12021–12031. https://doi.org/10.1109/CVPR52729.2023.01157. Google Scholar
[18]	Ma N N, Zhang X Y, Zheng H T, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 116–131. https://doi.org/10.1007/978-3-030-01264-9_8. Google Scholar
[19]	Woo S, Park J, Lee J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 3–19. https://doi.org/10.1007/978-3-030-01234-2_1. Google Scholar
[20]	张冲, 黄影平, 郭志阳, 等. 基于语义分割的实时车道线检测方法[J]. 光电工程, 2022, 49(5): 210378. doi: 10.12086/oee.2022.210378 CrossRef Google Scholar Zhang C, Huang Y P, Guo Z Y, et al. Real-time lane detection method based on semantic segmentation[J]. Opto-Electron Eng, 2022, 49(5): 210378. doi: 10.12086/oee.2022.210378 CrossRef Google Scholar
[21]	Huang Z L, Wang X G, Huang L C, et al. CCNet: criss-cross attention for semantic segmentation[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 603–612. https://doi.org/10.1109/ICCV.2019.00069. Google Scholar
[22]	吴刚, 葛芸, 储珺, 等. 面向遥感图像检索的级联池化自注意力研究[J]. 光电工程, 2022, 49(12): 220029. doi: 10.12086/oee.2022.220029 CrossRef Google Scholar Wu G, Ge Y, Chu J, et al. Cascade pooling self-attention research for remote sensing image retrieval[J]. Opto-Electron Eng, 2022, 49(12): 220029. doi: 10.12086/oee.2022.220029 CrossRef Google Scholar
[23]	Xia Z F, Pan X R, Song S J, et al. Vision transformer with deformable attention[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 4794–4803. https://doi.org/10.1109/CVPR52688.2022.00475. Google Scholar
[24]	Zhu L, Wang X J, Ke Z H, et al. BiFormer: vision transformer with Bi-level routing attention[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 10323–10333. https://doi.org/10.1109/CVPR52729.2023.00995. Google Scholar
[25]	Wang J Q, Chen K, Xu R, et al. CARAFE: content-aware ReAssembly of FEatures[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 3007–3016. https://doi.org/10.1109/ICCV.2019.00310. Google Scholar
[26]	刘春娟, 乔泽, 闫浩文, 等. 基于多尺度互注意力的遥感图像语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(7): 1335−1344. doi: 10.3785/j.issn.1008-973X.2023.07.008 CrossRef Google Scholar Liu C J, Qiao Z, Yan H W, et al. Semantic segmentation network for remote sensing image based on multi-scale mutual attention[J]. J Zhejiang Univ (Eng Sci), 2023, 57(7): 1335−1344. doi: 10.3785/j.issn.1008-973X.2023.07.008 CrossRef Google Scholar
[27]	Lu H, Liu W Z, Fu H T, et al. FADE: fusing the assets of decoder and encoder for task-agnostic upsampling[C]//Proceedings of the 17th European Conference on Computer Vision, 2022: 231–247. https://doi.org/10.1007/978-3-031-19812-0_14. Google Scholar
[28]	Li H C, Xiong P F, Fan H Q, et al. DFANet: deep feature aggregation for real-time semantic segmentation[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9522–9531. https://doi.org/10.1109/CVPR.2019.00975. Google Scholar
[29]	Yi Q M, Dai G S, Shi M, et al. ELANet: effective lightweight attention-guided network for real-time semantic segmentation[J]. Neural Process Lett, 2023, 55(5): 6425−6442. doi: 10.1007/s11063-023-11145-z CrossRef Google Scholar
[30]	石敏, 沈佳林, 易清明, 等. 快速超轻量城市交通场景语义分割[J]. 计算机科学与探索, 2022, 16(10): 2377−2386. doi: 10.3778/j.issn.1673-9418.2203015 CrossRef Google Scholar Shi M, Shen J L, Yi Q M, et al. Rapid and ultra-lightweight semantic segmentation in urban traffic scene[J]. J Front Comput Sci Technol, 2022, 16(10): 2377−2386. doi: 10.3778/j.issn.1673-9418.2203015 CrossRef Google Scholar
[31]	易清明, 张文婷, 石敏, 等. 多尺度特征融合的道路场景语义分割[J]. 激光与光电子学进展, 2023, 60(12): 1210006. doi: 10.3788/LOP220914 CrossRef Google Scholar Yi Q M, Zhang W T, Shi M, et al. Semantic segmentation for road scene based on multiscale feature fusion[J]. Laser Optoelectron Prog, 2023, 60(12): 1210006. doi: 10.3788/LOP220914 CrossRef Google Scholar
[32]	兰建平, 董冯雷, 杨亚会, 等. 改进STDC-Seg的实时图像语义分割网络算法[J]. 传感器与微系统, 2023, 42(11): 110−113,118. doi: 10.13873/J.1000-9787(2023)11-0110-04 CrossRef Google Scholar Lan J P, Dong F L, Yang Y H, et al. Real-time image semantic segmentation network algorithm based on improved STDC-Seg[J]. Transducer Microsyst Technol, 2023, 42(11): 110−113,118. doi: 10.13873/J.1000-9787(2023)11-0110-04 CrossRef Google Scholar

Overview

Overview

In response to the challenge posed by the large parameter sizes of semantic segmentation networks, which complicate deployment on memory-constrained edge devices, a lightweight real-time semantic segmentation algorithm based on BiLevelNet is proposed. Initially, dilated convolutions are utilized to broaden the receptive field, and strategies for reusing features are integrated to bolster the network's awareness of regions. Subsequently, a two-stage PBRA (Partial Bi-Level Route Attention) mechanism is adopted to form connections between distant objects, thereby enhancing the network's capability to perceive global contexts. Moreover, the FADE operator is introduced for merging shallow features, thereby augmenting the efficacy of image upsampling.

Within the depicted AFR module in Fig. 4, a variety of hierarchical feature maps are presented, along with descriptions of their characteristics and roles. The distinctions and connections between the input feature map, the local feature map achieved through 3×3 depth convolution, and the context information feature map acquired through dilated convolution are clarified. It is further emphasized how these features are effectively amalgamated in the final fused feature map, showcasing strong activation across both local and global contexts. Additionally, a gradually decreasing channel reduction factor is employed, as elaborated in Table 3. Through the gradual adjustment of the channel reduction factor, it is observed that with a reduction factor of r=1/4, the PBRA module enhances mIoU by 1.5% and boosts speed by 12FPS in comparison to BRA.

Moreover, discontinuities and missing pixels are noted in segmentation results when bilinear interpolation is used for upsampling. Observations of the depth feature maps prior to bilinear upsampling reveal that features corresponding to roads and sidewalks bear similarities, leading to potential misclassifications. To counteract this issue, shallow features that preserve edge information are introduced and merged into the FADE upsampling process, thereby improving edge segmentation. This method effectively addresses the loss of spatial information, resulting in smoother and more defined edge segmentation outcomes.

Experimental outcomes indicate that, at an input image resolution of 512×1024, the network attains an average Intersection over Union (IoU) of 75.1% on the Cityscapes dataset, operating at a speed of 121 frames per second, while maintaining a modest model size of only 0.7M. Furthermore, at an input image resolution of 360×480, the network secures an average IoU of 68.2% on the CamVid dataset. Compared with other real-time semantic segmentation methods, this network maintains an optimal balance between speed and accuracy, fulfilling the real-time operation requirements for applications such as autonomous driving.