A dynamic CTA scheduling scheme for massive parallel computing

Son, Dong Oh; Do, Cong Thuan; Choi, Hong Jun; Nam, Jiseung; Kim, Cheol Hong

doi:10.1007/s10586-017-0768-9

A dynamic CTA scheduling scheme for massive parallel computing

Published: 14 February 2017

Volume 20, pages 781–787, (2017)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Dong Oh Son¹,
Cong Thuan Do¹,
Hong Jun Choi²,
Jiseung Nam¹ &
…
Cheol Hong Kim¹

326 Accesses
5 Citations
Explore all metrics

Abstract

Recent computing devices execute massive parallel data requiring huge computing hardware. To satisfy increasing computing need, GPUs providing powerful computational capability are employed to execute both graphics and general-purpose applications (GPGPUs). In the GPGPU, executing multiple applications together can increase the data parallelism, resulting in high resource utilization. Improving the resource utilization of the GPGPU can increase the GPGPU performance. However, various kinds of applications have different execution time depending on their workload sizes. Therefore, if one application is completed earlier than the other ones, resource underutilization problem may happen because the hardware resource allocated for the early completed application becomes idle. In this work, a CTA-aware dynamic streaming multiprocessors scheduling scheme is proposed for multiple applications execution in the GPGPU to efficiently manage hardware resources. Simulation results show that the proposed CTA-aware dynamic SM scheduling scheme can increase the GPU performance by up to 25.6% on average.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CTA-Aware Dynamic Scheduling Scheme for Streaming Multiprocessors in High-Performance GPUs

K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs

Article 12 October 2021

Sejin Kim & Yoonhee Kim

SeloGPU: A Selective Off-Loading Framework for High Performance GPGPU Execution

References

Agarwal, V., Hrishikesh, M.S., Keckler, S.W., Burger, D.: CLK rate versus IPC: the end of the road for conventional microarchitectures, In: Proceedings of the 27th International Symposium on Computer Architecture. pp. 248–259 (2000)
Olukotun, K., Nayfeh, B.A., Hammond, L., Wilson, K., Chang, K.: The case for a single-chip multiprocessor. In: Proceedings of 7th Conference on Architectural Support for Programming Languages and Operating Systems, pp. 2–11 (1996)
Hill, M.D., Marty, M.R.: Amdahls law in the multicore era. IEEE Comput. 41(7), 33–38 (2008)
Article Google Scholar
Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for GPUs: stream computing on graphics hardware. In: Proceedings of Computer Graphics and Interactive Techniques (SIGGRAPH), pp. 777–786 (2004)
Lee, V.W., Kim, C.K., Chhugani, J., Deisher, M., Kim, D.H., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU Myth: an Evaluation of Throughput Computing on CPU and GPU. In: Proceedings of International Symposium on Computer Architecture, pp. 451–460 (2010)
General-purpose computation on graphics hardware. http://www.gpgpu.org
NVIDIA CUDA programming. http://www.nvidia.com/object/cuda_home_new.html
OpenCL. http://www.khronos.org/opencl/
ATI Streaming. http://www.amd.com/stream
Tanasic, I., Gelado, I., Cabezas, J., Ramrez, A., Navarro, N., Valero, M.: Enabling preemptive multiprogramming on GPUs. In: Proceedings of the 41st International Symposium on Computer Architecture, pp. 193–204 (2014)
Xie, X., Liang, Y., Wang, Y., Sun, G., Wang, T.: Coordinated static and dynamic cache bypassing for GPUs. In: Proceedings of 21th IEEE International Symposium on High Performance Computer Architecture, pp. 76–88 (2015)
Voitsechov, D., Etsion, Y.: Single-graph multiple flows: energy efficient design alternative for GPGPUs. In: Proceedings of the 41st International Symposium on Computer Architecture, pp. 205–216 (2014)
Lee, S., Arunkumar, A., Wu, C.: CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In: Proceedings of the 42st International Symposium on Computer Architecture, pp. 515–527 (2015)
Wu, G.Y., Greathouse, J.L., Lyashevsky, A., Jayasena, N., Chiou, D.: GPGPU performance and power estimation using machine learning. In: Proceedings of 21th IEEE International Symposium on High Performance Computer Architecture, pp. 564–576 (2015)
Lee, M., Song, S., Moon, J., Kim, J., Seo, W., Cho, Y., Ryu, S.: Improving GPGPU resources utilization through alternative thread block scheduling. In: Proceedings of 20th IEEE International Symposium on High Performance Computer Architecture, pp. 260–271 (2014)
Jog, A., Bolotin, E., Guz, Z., Parker, M., Keckler, S.W., Kandemir, M.T., Das, C. R.: Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In: proceedings of 7th Workshop on General Purpose Processing Using GPUs (2014)
NVIDIA GTX 780-Ti. http://www.nvidia.com/gtx-700-graphics-cards/gtx-780ti/
Son, D.O., Do, C.T., Choi, H.J., Kim, J.M., Park, J.H., Kim, C.H.: CTA-aware dynamic scheduling scheme for streaming multiprocessors in high-performance GPUs. In: Proceedings of Information Science and Applications (ICISA 2016), Vol. 376, pp. 1391–1399 (2016)
NVIDIAs Next Generation CUDA Compute Architecture: Fermi, www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper
Thornton, J.E.: Parallel operation in the control data 6600. In: Proceedings of AFIPS Proceedings of FJCC, Part. 2, Vol. 26, pp. 33–40 (1964)
Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of 9th International Symposium on Performance Analysis of Systems and Software, pp. 163–174 (2009)
Bakhoda, A.G., Yuan, L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of 9th International Symposium on Performance Analysis of Systems and Software, pp. 163–174 (2009)
Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P.: McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the International Symposium on Microarchitecture, pp. 469–480 (2009)
CUDA SDK. http://developerdownload.nvidia.com/compute/cuda/sdk/website/samples.html

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2015R1D1A3A01019454), and it was also supported by the MSIP(Ministry of Science, ICT and Future Planning), Korea, under the ITRC(Information Technology Research Center) support program (IITP-2016-R2718-16-0011) supervised by the IITP(Institute for Information & communications Technology Promotion).

Author information

Authors and Affiliations

School of Electronics and Computer Engineering, Chonnam National University, Gwangju, Korea
Dong Oh Son, Cong Thuan Do, Jiseung Nam & Cheol Hong Kim
The Attached Institute of ETRI, Daejeon, Korea
Hong Jun Choi

Authors

Dong Oh Son
View author publications
You can also search for this author in PubMed Google Scholar
Cong Thuan Do
View author publications
You can also search for this author in PubMed Google Scholar
Hong Jun Choi
View author publications
You can also search for this author in PubMed Google Scholar
Jiseung Nam
View author publications
You can also search for this author in PubMed Google Scholar
Cheol Hong Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheol Hong Kim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Son, D.O., Do, C.T., Choi, H.J. et al. A dynamic CTA scheduling scheme for massive parallel computing. Cluster Comput 20, 781–787 (2017). https://doi.org/10.1007/s10586-017-0768-9

Download citation

Received: 27 October 2016
Accepted: 30 January 2017
Published: 14 February 2017
Issue Date: March 2017
DOI: https://doi.org/10.1007/s10586-017-0768-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A dynamic CTA scheduling scheme for massive parallel computing

Abstract

Access this article

Similar content being viewed by others

CTA-Aware Dynamic Scheduling Scheme for Streaming Multiprocessors in High-Performance GPUs

K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs

SeloGPU: A Selective Off-Loading Framework for High Performance GPGPU Execution

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A dynamic CTA scheduling scheme for massive parallel computing

Abstract

Access this article

Similar content being viewed by others

CTA-Aware Dynamic Scheduling Scheme for Streaming Multiprocessors in High-Performance GPUs

K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs

SeloGPU: A Selective Off-Loading Framework for High Performance GPGPU Execution

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation