Citation: | Peng Bo, Luo Shasha, Yang Feng, et al. Performance analysis of a sum-table-based method for computing cross-correlation in GPU-accelerated ultrasound strain elastography[J]. Opto-Electronic Engineering, 2019, 46(6): 180437. doi: 10.12086/oee.2019.180437 |
[1] | Jiang J, Hall T J. A parallelizable real-time motion tracking algorithm with applications to ultrasonic strain imaging[J]. Physics in Medicine & Biology, 2007, 52(13): 3773-3790. doi: 10.1088/0031-9155/52/13/008 |
[2] | Chen L J, Treece G M, Lindop J E, et al. A quality-guided displacement tracking algorithm for ultrasonic elasticity imaging[J]. Medical Image Analysis, 2009, 13(2): 286-296. doi: 10.1016/j.media.2008.10.007 |
[3] | Peng B, Wang Y Q, Hall T J, et al. A GPU-accelerated 3-D coupled subsample estimation algorithm for volumetric breast strain elastography[J]. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 2017, 64(4): 694-705. doi: 10.1109/TUFFC.2017.2661821 |
[4] | Zhou Y J, Zheng Y P. A motion estimation refinement framework for real-time tissue axial strain estimation with freehand ultrasound[J]. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 2010, 57(9): 1943-1951. doi: 10.1109/TUFFC.2010.1642 |
[5] | Luo J W, Konofagou E E. A fast normalized cross-correlation calculation method for motion estimation[J]. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 2010, 57(6): 1347-1357. doi: 10.1109/TUFFC.2010.1554 |
[6] | Zhu Y N, Hall T J. A modified block matching method for real-time freehand strain imaging[J]. Ultrasonic Imaging, 2002, 24(3): 161-176. doi: 10.1177/016173460202400303 |
[7] | D'Hooge J, Bijnens B, Thoen J, et al. Echocardiographic strain and strain-rate imaging: a new tool to study regional myocardial function[J]. IEEE Transactions on Medical Imaging, 2002, 21(9): 1022-1030. doi: 10.1109/TMI.2002.804440 |
[8] | Konofagou E E, D'Hooge J, Ophir J. Myocardial elastography--a feasibility study in vivo[J]. Ultrasound in Medicine & Biology, 2002, 28(4): 475-482. doi: 10.1016/S0301-5629(02)00488-X |
[9] | Lewis J P. Fast template matching[J]. Proceeding of Vision Interface, 1995, 32(4): 351-361. |
[10] | Yang X, Deka S, Righetti R. A hybrid CPU-GPGPU approach for real-time elastography[J]. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 2011, 58(12): 2631-2645. doi: 10.1109/TUFFC.2011.2126 |
[11] | 彭博, 黄丽. GPU加速的高精度位移估计方法及超声弹性成像应用[J].光电工程, 2016, 43(6): 83-88. doi: 10.3969/j.issn.1003-501X.2016.06.014 Peng B, Huang L. GPU-accelerated sub-sample displacement estimation method for real-time ultrasound elastography[J]. Opto-Electronic Engineering, 2016, 43(6): 83-88. doi: 10.3969/j.issn.1003-501X.2016.06.014 |
[12] | 彭博, 谌勇, 刘东权.基于GPU的超声弹性成像并行实现研究[J].光电工程, 2013, 40(5): 97-105. doi: 10.3969/j.issn.1003-501X.2013.05.014 Peng B, Chen Y, Liu D Q. Investigation of GPU-based ultrasound elastography[J]. Opto-Electronic Engineering, 2013, 40(5): 97-105. doi: 10.3969/j.issn.1003-501X.2013.05.014 |
[13] | Rosenzweig S, Palmeri M, Nightingale K. GPU-based real-time small displacement estimation with ultrasound[J]. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 2011, 58(2): 399-405. doi: 10.1109/TUFFC.2011.1817 |
[14] | Chang L W, Hsu K H, Li P C. GPU-based color Doppler ultrasound processing[C]//2009 IEEE International Ultrasonics Symposium. Rome, Italy, 2009. |
[15] | Sun X, Wang S S, Song J J, et al. Toward parallel optimal computation of ultrasound computed tomography using GPU[J]. Proceedings of SPIE, 2018, 10580: 105800R. |
[16] | Sengupta S, Harris M, Garland M, et al. Efficient parallel scan algorithms for GPUs[M]//Kurzak J, Bader D A, Dongarra J. Scientific Computing with Multicore and Accelerators. Boca Raton: Taylor & Francis, 2008. |
[17] | Blelloch G E. Scans as primitive parallel operations[J]. IEEE Transactions on Computers, 2002, 38(11): 1526-1538. doi: 10.1109/12.42122 |
[18] | Jensen J A. Field: A program for simulating ultrasound systems[J]. Medical & Biological Engineering & Computing, 1996, 34(1): 351-352. |
[19] | Luo J W, Bai J, He P, et al. Axial strain calculation using a low-pass digital differentiator in ultrasound elastography[J]. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 2004, 51(9): 1119-1127. doi: 10.1109/TUFFC.2004.1334844 |
[20] | Du H N, Liu J, Pellot-Barakat C, et al. Optimizing multicompression approaches to elasticity imaging[J]. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 2006, 53(1): 90-99. doi: 10.1109/TUFFC.2006.1588394 |
Overview: In our ultrasound strain elastography system, a modified block-matching algorithm is adopted to assess tissue motion. Then, local strains are assessed and used as surrogates of tissue elasticity. The calculation of correlation under the framework of the block-matching algorithm is a critical step and very computationally intensive. Because the correlation calculation is largely independent, graphics processing units (GPUs) have been utilized to improve computational efficiency through massive parallel programming. It is known in the literature that the sum-table based method can greatly reduce the computing burden when the calculation of the normalized correlation coefficient is needed in a serial computing environment. The sum-table based method is abbreviated as ST-NCC below. However, the performance of ST-NCC is yet to be investigated given a parallel computing platform, particularly, in a GPU environment. Consequently, our objective of this study is to investigate the performance of the ST-NCC method for the above-mentioned GPU-accelerated ultrasound strain elastography. More specifically, a published ST-NCC method by Luo et al. and the conventional NCC method were both programmed using CUDA (Version 9.0, NVIDIA Inc., CA, USA) and tested on an NVIDIA GeForce GTX TITAN X card. During the CUDA implementation, in order to achieve the best computational efficiency, two basic CUDA programming strategies were employed to improve computational efficiency for all CUDA implementation. First, in order to increase the memory bandwidth of GPUs, TEXTURE (memory) access was used for storing 2-D RF signals prior to the calculation of cross correlation. Second, programming variables that require frequent access (e.g., axial and lateral search ranges) were locked in read-only memory for rapid access. In terms of advanced CUDA programming strategies, on the one hand, a classic parallel scan method was adopted to generate those sum-table data for the ST-NCC method. On the other hand, a few different on-ship memory optimization strategies were used to implement the classic NCC method and they were compared against each other. Only the computationally most efficient implementation was used to compare with the above-mentioned GPU-accelerated ST-NCC method. Finally, performance assessments were conducted using simulated ultrasound data. Ultrasound data simulations involve both finite element modeling and acoustic simulations. Both displacement tracking accuracy and computational efficiency were evaluated during the performance assessments. Based on data investigated, we found that, under the GPU platform, the implemented ST-NCC method did not further improve the computational efficiency, as compared to the classic NCC method implemented into the same GPU platform. Comparable displacement tracking accuracy was obtained by both methods.
Diagram of ultrasonic speckle motion tracking process
An illustration of the parallel scan method for calculating the sum-table
Comparison of computation time for method 1 under different optimization strategies
A comparative performance analysis of CPU and GPU implementations. (a) and (d) are lateral and axial displacements obtained using CPU implementation of method 1; (b) and (e) are corresponding strain images; (c) and (f) are difference images of lateral and axial displacements between two methods on CPU; (g) and (i) are difference images of lateral and axial displacements between the CPU and GPU implementations of method 1; (h) and (j) are the difference images of displacement between GPU implementation of method 1 and method 2
Comparison of computation time of two method under different cross-correlation tracking windows. (a) Computation time of CPU implementation of the two methods; (b) Computation time of GPU implementation of the two methods
Comparison of computation time of two methods under different search ranges. (a) Computation time of CPU implementation of two methods; (b) Computation time of GPU implementation of two methods