Citation: | Hou G P, Dong W, Lu L K, et al. Smartphone image quality assessment method based on Swin-AK Transformer[J]. Opto-Electron Eng, 2025, 52(1): 240264. doi: 10.12086/oee.2025.240264 |
[1] | 鄢杰斌, 方玉明, 刘学林. 图像质量评价研究综述——从失真的角度[J]. 中国图象图形学报, 2022, 27 (5): 1430−1466. doi: 10.11834/jig.210790 Yan J B, Fang Y M, Liu X L. The review of distortion-related image quality assessment[J]. J Image Graphics, 2022, 27 (5): 1430−1466. doi: 10.11834/jig.210790 |
[2] | Ke J J, Wang Q F, Wang Y L, et al. MUSIQ: multi-scale image quality transformer[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021: 5128–5137. https://doi.org/10.1109/ICCV48922.2021.00510. |
[3] | Varga D. No-reference image quality assessment using the statistics of global and local image features[J]. Electronics, 2023, 12 (7): 1615. doi: 10.3390/electronics12071615 |
[4] | Jain P, Shikkenawis G, Mitra S K. Natural scene statistics and CNN based parallel network for image quality assessment[C]//2021 IEEE International Conference on Image Processing (ICIP), 2021: 1394–1398. https://doi.org/10.1109/ICIP42928.2021.9506404. |
[5] | Shao X, Liu M Q, Li Z H, et al. CPDINet: blind image quality assessment via a content perception and distortion inference network[J]. IET Image Processing, 2022, 16 (7): 1973−1987. doi: 10.1049/ipr2.12463 |
[6] | Zhao K, Yuan K, Sun M, et al. Quality-aware pretrained models for blind image quality assessment[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023: 22302–22313. https://doi.org/10.1109/CVPR52729.2023.02136. |
[7] | Fang Y M, Zhu H W, Zeng Y, et al. Perceptual quality assessment of smartphone photography[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 3674–3683. https://doi.org/10.1109/CVPR42600.2020.00373. |
[8] | Yuan Z F, Qi Y, Hu M H, et al. Opinion-unaware no-reference image quality assessment of smartphone camera images based on aesthetics and human perception[C]//2020 IEEE International Conference on Multimedia & Expo Workshops, 2020: 1–6. https://doi.org/10.1109/ICMEW46912.2020.9106048. |
[9] | Zhou Y W, Wang Y L, Kong Y Y, et al. Multi-Indicator image quality assessment of smartphone camera based on human subjective behavior and perception[C]//2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 2020: 1–6. https://doi.org/10.1109/ICMEW46912.2020.9105971. |
[10] | Huang C H, Wu J L. Multi-task deep CNN model for no-reference image quality assessment on smartphone camera photos[Z]. arXiv: 2008.11961, 2020. https://arxiv.org/abs/2008.11961. |
[11] | Yao C, Lu Y R, Liu H, et al. Convolutional neural networks based on residual block for no-reference image quality assessment of smartphone camera images[C]//Proceedings of the 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 2020: 1–6. https://doi.org/10.1109/ICMEW46912.2020.9106034. |
[12] | 王斌, 白永强, 朱仲杰, 等. 联合空角信息的无参考光场图像质量评价[J]. 光电工程, 2024, 51 (9): 240139. doi: 10.12086/oee.2024.240139 Wang B, Bai Y Q, Zhu Z J, et al. No-reference light field image quality assessment based on joint spatial-angular information[J]. Opto-Electron Eng, 2024, 51 (9): 240139. doi: 10.12086/oee.2024.240139 |
[13] | 陈松, 温宇鑫, 安浩铭. 基于多尺度双边滤波Retinex的非均匀光照散斑图像矫正[J/OL]. 激光技术, 1-16 [2025-01-18]. http://kns.cnki.net/kcms/detail/51.1125.TN.20240116.1129.004.html. |
[14] | 刘佳, 唐鋆磊, 林冰, 等. 基于HSV (色相-饱和度-明度)与形状特征的涂层锈点图像识别[J]. 中国表面工程, 2023, 36 (4): 217−228. doi: 10.11933/j.issn.1007-9289.20221008001 Liu J, Tang J L, Lin B, et al. Rust spot image recognition of coatings based on HSV and shape feature[J]. China Surf Eng, 2023, 36 (4): 217−228. doi: 10.11933/j.issn.1007-9289.20221008001 |
[15] | Liu Q G, Liu P, Wang Y H, et al. Semi-parametric decolorization with Laplacian-based perceptual quality metric[J]. IEEE Trans Circuits Syst Video Technol, 2017, 27 (9): 1856−1868. doi: 10.1109/tcsvt.2016.2555779 |
[16] | 马常昊, 胡文惠, 钟海超, 等. 融合Sobel算子的SAR图像结构优化方法[J]. 探测与控制学报, 2024, 46 (2): 119−124. Ma C H, Hu W H, Zhong H C, et al. SAR image structure optimization method using Sobel operator fusion[J]. J Detect Control, 2024, 46 (2): 119−124. |
[17] | 罗小燕, 胡振, 汤文聪, 等. 傅里叶和LBP描述子相结合的矿石颗粒种类识别[J]. 传感器与微系统, 2023, 42 (11): 147−150. doi: 10.13873/J.1000-9787(2023)11-0147-04 Luo X Y, Hu Z, Tang W C, et al. Species identification of ore particles combined with Fourier and LBP descriptors[J]. Transducer Microsyst Technol, 2023, 42 (11): 147−150. doi: 10.13873/J.1000-9787(2023)11-0147-04 |
[18] | Liu Z, Lin Y T, Cao Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021: 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986. |
[19] | 王高平, 李珣, 贾雪芳, 等. 融合Swin Transformer的立体匹配方法STransMNet[J]. 光电工程, 2023, 50 (4): 220246. doi: 10.12086/oee.2023.220246 Wang G P, Li X, Jia X F, et al. STransMNet: a stereo matching method with Swin Transformer fusion[J]. Opto-Electron Eng, 2023, 50 (4): 220246. doi: 10.12086/oee.2023.220246 |
[20] | Ghadiyaram D, Bovik A C. Massive online crowdsourced study of subjective and objective picture quality[J]. IEEE Trans Image Process, 2016, 25 (1): 372−387. doi: 10.1109/TIP.2015.2500021 |
[21] | Saad M A, Bovik A C, Charrier C. Blind image quality assessment: a natural scene statistics approach in the DCT domain[J]. IEEE Trans Image Process, 2012, 21 (8): 3339−3352. doi: 10.1109/TIP.2012.2191563 |
[22] | Moorthy A K, Bovik A C. Blind image quality assessment: from natural scene statistics to perceptual quality[J]. IEEE Trans Image Process, 2011, 20 (12): 3350−3364. doi: 10.1109/TIP.2011.2147325 |
[23] | Mittal A, Moorthy A K, Bovik A C. No-reference image quality assessment in the spatial domain[J]. IEEE Trans Image Process, 2012, 21 (12): 4695−4708. doi: 10.1109/TIP.2012.2214050 |
[24] | Ye P, Kumar J, Kang L, et al. Unsupervised feature learning framework for no-reference image quality assessment[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012: 1098–1105. https://doi.org/10.1109/CVPR.2012.6247789. |
[25] | Zhang L, Zhang L, Bovik A C. A feature-enriched completely blind image quality evaluator[J]. IEEE Trans Image Process, 2015, 24 (8): 2579−2591. doi: 10.1109/TIP.2015.2426416 |
[26] | Xu J T, Ye P, Li Q H, et al. Blind image quality assessment based on high order statistics aggregation[J]. IEEE Trans Image Process, 2016, 25 (9): 4444−4457. doi: 10.1109/TIP.2016.2585880 |
[27] | Bosse S, Maniry D, Müller K R, et al. Deep neural networks for no-reference and full-reference image quality assessment[J]. IEEE Trans Image Process, 2018, 27 (1): 206−219. doi: 10.1109/TIP.2017.2760518 |
[28] | Yan Q S, Gong D, Zhang Y N. Two-stream convolutional networks for blind image quality assessment[J]. IEEE Trans Image Process, 2019, 28 (5): 2200−2211. doi: 10.1109/TIP.2018.2883741 |
[29] | Golestaneh S A, Dadsetan S, Kitani K M. No-reference image quality assessment via transformers, relative ranking, and self-consistency[C]//Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, 2022: 3989–3999. https://doi.org/10.1109/WACV51458.2022.00404. |
[30] | Zhang W X, Ma K D, Yan J, et al. Blind image quality assessment using a deep bilinear convolutional neural network[J]. IEEE Trans Circuits Syst Video Technol, 2020, 30 (1): 36−47. doi: 10.1109/TCSVT.2018.2886771 |
[31] | Su S L, Yan Q S, Zhu Y, et al. Blindly assess image quality in the wild guided by a self-adaptive hyper network[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 3664–3673. https://doi.org/10.1109/CVPR42600.2020.00372. |
[32] | Wu J J, Ma J P, Liang F H, et al. End-to-end blind image quality prediction with cascaded deep neural network[J]. IEEE Trans Image Process, 2020, 29: 7414−7426. doi: 10.1109/TIP.2020.3002478 |
[33] | Pan Z Q, Zhang H, Lei J J, et al. DACNN: Blind image quality assessment via a distortion-aware convolutional neural network[J]. IEEE Trans Circuits Syst Video Technol, 2022, 32 (11): 7518−7531. doi: 10.1109/TCSVT.2022.3188991 |
[34] | Saha A, Mishra S, Bovik A C. Re-IQA: Unsupervised learning for image quality assessment in the wild[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 5846–5855. https://doi.org/10.1109/CVPR52729.2023.00566. |
[35] | Qin G Y, Hu R Z, Liu Y T, et al. Data-efficient image quality assessment with attention-panel decoder[C]//Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023: 2091–2100. https://doi.org/10.1609/aaai.v37i2.25302. |
[36] | Xu K M, Liao L, Xiao J, et al. Boosting image quality assessment through efficient transformer adaptation with local feature enhancement[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 2662–2672. https://doi.org/10.1109/CVPR52733.2024.00257. |
With the extensive use of smartphones, users’ expectations for smartphone image quality have risen significantly. However, due to limitations in camera hardwares, smartphones often face constraints in light capture, especially in complex or low-light scenarios, which can lead to image quality degradation. Existing no-reference image quality assessment (IQA) algorithms frequently show limitations when handling smartphone-captured images, motivating the development of a more accurate quality evaluation method. This study proposes an approach based on manual features and a Swin-AK Transformer with dual cross-attention fusion, designed to assess smartphone image quality with greater precision. First, manual features affecting image quality are extracted, guided by the human visual system, enabling the capture of subtle visual variations such as color, contrast, and texture, which enhances the model’s sensitivity to image quality. To further improve the discriminative power for image quality assessment, ResNet50 is introduced after manual feature extraction to establish a nonlinear mapping between manual features and image quality. This process transforms initial low-level features into more representative high-level features, allowing for a more comprehensive expression of image content. Subsequently, the study introduces the Swin-AK Transformer, which utilizes a self-attention mechanism to capture local image features, thereby enhancing the model’s capability to recognize and process local information in smartphone images. This method effectively adapts to the unique characteristics of smartphone images, offering robust handling of intricate details. Additionally, a dual cross-attention fusion module is designed to integrate manual and deep features efficiently. The module combines spatial and channel attention mechanisms: spatial attention aids the model in focusing on key areas within the image, while channel attention optimizes feature representation by adjusting the weights of each channel. As a result, the fused features reflect both global image information and local detail variations, aligning well with the human visual system’s natural perception of image quality. Experiments were conducted on two public datasets, SPAQ and LIVE-C, to evaluate the proposed model. The results demonstrate the model’s superior performance in image quality prediction, achieving Pearson correlation coefficients of 0.932 and 0.885 and Spearman rank correlation coefficients of 0.929 and 0.858 on the SPAQ and LIVE-C datasets, respectively. These outcomes validate the proposed method’s effectiveness in smartphone image quality assessment tasks, showcasing improved sensitivity to quality changes and excellent accuracy and robustness.
Overall structure diagram of the proposed method
Diagram of manual feature extraction
ResNet50 architecture diagram
Diagram of the sliding window operation in Swin Transformer
Swin-AK Transformer architecture diagram
Swin-AK blocks architecture diagram
AKConv architecture diagram
Structure diagram of the dual attention cross fusion module
Channel attention network structure diagram
Structure diagram of the spatial attention module
Scatter plot of image attribute scores versus overall subjective quality scores in the SPAQ. (a) Brightness; (b) Colorfulness; (c) Sharpness
Scatter plot on the LIVE-C dataset
Scatter plot on the SPAQ dataset
Comparison of attention heatmaps between Swin Transformer and Swin-AK Transformer
MOS values of images in the SPAQ dataset and the quality prediction values of the proposed method
MOS values of images in the LIVE-C dataset and the quality prediction values of the proposed method