Performance Evaluation of the Intel Many Integrated Core Architecture for 3D Image Reconstruction in Computed Tomography

Language
en
Document Type
Master Thesis
Issue Date
2014-01-20
Issue Year
2013
Authors
Hofmann, Johannes
Editor
Abstract

The computational effort of 3D image reconstruction in Computed Tomography (CT) has required special purpose hardware for a long time. Systems such as custom-built FPGA- systems and GPUs are still widely-used today, in particular in interventional settings, where radiologists require a hard time constraint for reconstruction. However, recently is has been shown that today even commodity CPUs are capable of performing the reconstruction within the imposed time-constraint. In this thesis, we examine the Intel Many Integrated Cores (MIC) architecture for its suit- ability to run the Feldkamp-Davis-Kress (FDK) algorithm—the most commonly used algo- rithm to perform the 3D image reconstruction in cone-beam computed tomography. In com- parison to traditional CPUs the MIC accelerator card, which focuses on numerical applica- tions, is expected to deliver higher performance using the same programming models such as C, C++, and Fortran. A thorough analysis of the MIC architecture is performed to determine potential hardware bottlenecks and to distinguish its design from a current state of the art two-socket Intel Sandy Bridge EP CPU system. We study the challenges of efficiently parallelizing the FDK kernel on the Intel MIC and find that careful OpenMP scheduling and thread placement is required due to lack of a shared last level cache. Efficient data sharing on the Intel MIC can only occur between hardware threads of a core via its local L1 and L2 cache segments. Apart from parallelization, SIMD vectorization is critical for good performance on the In- tel MIC, whose vector registers are twice the size of vector registers found in contemporary CPUs. To classify the difficulty of harnessing the full potential of vectorization on the MIC platform we explore various approaches to vectorize the kernel: Auto-vectorization using the Intel C Compiler and the Intel SPMD Compiler, as well as manual vectorization using C with intrinsics and manual assembly coding. We used the fastest available CPU implementation from Treibig et al., developed for the RabbitCT benchmarking framework, as starting point for our optimizations. By making im- provements to the original implementation, we speed up execution by 25% on the CPU. In line with the estimate of our performance model, measurements on the Intel MIC deliver a speedup of 1.5 in comparison to the reference CPU system. Our analysis reveals the major bottleneck for our application to be shortcomings in hardware: The majority of data re- quired for the reconstruction is scattered in memory; gathering this data into vector registers for processing is still done sequentially on the Intel MIC. While computations in the kernel benefit from vectorization, the sequential loading limits the maximum achievable speedup in accordance with Amdahl’s law.

DOI
Faculties & Collections
Zugehörige ORCIDs