Abstract
Historically, high energy physics computing has been performed on large purpose-built computing systems. These began as single-site compute facilities, but have evolved into the distributed computing grids used today. Recently, there has been an exponential increase in the capacity and capability of commercial clouds. Cloud resources are highly virtualized and intended to be able to be flexibly deployed for a variety of computing tasks. There is a growing interest among the cloud providers to demonstrate the capability to perform large-scale scientific computing. In this paper, we discuss results from the CMS experiment using the Fermilab HEPCloud facility, which utilized both local Fermilab resources and virtual machines in the Amazon Web Services Elastic Compute Cloud. We discuss the planning, technical challenges, and lessons learned involved in performing physics workflows on a large-scale set of virtualized resources. In addition, we will discuss the economics and operational efficiencies when executing workflows both in the cloud and on dedicated resources.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41781-017-0001-9/MediaObjects/41781_2017_1_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41781-017-0001-9/MediaObjects/41781_2017_1_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41781-017-0001-9/MediaObjects/41781_2017_1_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41781-017-0001-9/MediaObjects/41781_2017_1_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41781-017-0001-9/MediaObjects/41781_2017_1_Fig5_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41781-017-0001-9/MediaObjects/41781_2017_1_Fig6_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41781-017-0001-9/MediaObjects/41781_2017_1_Fig7_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41781-017-0001-9/MediaObjects/41781_2017_1_Fig8_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41781-017-0001-9/MediaObjects/41781_2017_1_Fig9_HTML.gif)
Similar content being viewed by others
Notes
Optimizing Web Delivery, http://www.squid-cache.org.
AWS provides an API to provision both individual machines and in bulk (“spot fleet”). At the time of our demonstration, our underlying provisioning tools did not support spot fleet.
The open platform for analytics and monitoring, https://grafana.com.
Instance types that provided smaller contributions are not included.
References
Pordes R, Petravick D, Kramer B, Olson D, Livny M, Roy A, Avery P, Blackburn K, Wenaus T, Würthwein F, Foster I, Gardner R, Wilde M, Blatecky A, McGee J, Quick R (2007) The open science grid. J Phys Conf Ser 78:012057
Evans L, Bryant P (2008) LHC machine. J Instrum 3:S08001
Apollinari G, Alonso I B, Brüning O, Lamont M, Rossi L (eds) (2015) High-luminosity large hadron collider (HL-LHC) : preliminary design report. http://cds.cern.ch/record/2116337. Accessed 11 Jul 2017
Particle Physics Project Prioritization Panel (2014). Building for discovery: strategic plan for US particle physics in the global context. https://science.energy.gov/~/media/hep/hepap/pdf/May-2014/FINAL_P5_Report_053014.pdf. Accessed 11 Jul 2017
Mahmood Z, Saeed S (eds) (2013) Software engineering frameworks for the cloud computing paradigm. Springer, London
Augé E, Dumarchez J, Tran Thanh Van J (eds) (2016) Proceedings, 51st Rencontres de Moriond on QCD and High Energy Interactions : La Thuile, Italy, March 19-26, 2016. ARISF
Augé E, Dumarchez J, Tran Thanh Van J (eds) (2016) Proceedings, 51st Rencontres de Moriond on Electroweak Interactions and Unified Theories : La Thuile, Italy, March 12-19, 2016. ARISF
CMS Public Physics Results (2016). http://cms-results.web.cern.ch/cms-results/public-results/publications/. Accessed 07 Apr 2017
Taylor R, Berghaus F, Brasolin F, Cordeiro C, Desmarais R, Field L, Gable I, Giordano D, Di Girolamo A, Hover J, LeBlanc M, Love P, Paterson M, Sobie R, Zaytsev A (2015) The evolution of cloud computing in ATLAS. J Phys Conf Ser 664:022038
Benjamin D, Caballero J, Ernst M, Guan W, Hover J, Lesny D, Maeno T, Nilsson P, Tsulaia V, van Gemmeren P, Vaniachine A, Wang F, Wenaus T (2016) Scaling up ATLAS event service to production levels on opportunistic computing platforms. J Phys Conf Ser 762:012027
Grzymkowski R, Hara T (2015) Belle II public and private cloud management in VMDIRAC system. J Phys Conf Ser 664:022021
Andronis A, Bauer D, Chaze O, Colling D, Dobson M, Fayer S, Girone M, Grandi C, Huffman A, Hufnagel D, Khan F, Lahiff A, McCrea A, Rand D, Sgaravatto M, Tiradani A, Zhang X (2015) The diverse use of clouds by CMS. J Phys Conf Ser 664:022012
Gartner Group 2015 Magic Quadrant for Cloud Infrastructure as a Service, Worldwide (2015). https://www.gartner.com/doc/reprints?id=1-2G45TQU&ct=150519&st=sb. Accessed 07 Apr 2017
Andrews W, Bockelman B, Bradley B, Dost J, Evans D, Fisk I, Frey J, Holzman B, Livny M, Martin T, McCrea A, Melo A, Metson S, Pi H, Sfiligoi I, Sheldon P, Tannenbaum T, Tiradani A, Würthwein F, Weitzel D (2011) Early experience on using glideinWMS in the cloud. J Phys Conf Ser 331:062014
Evans D, Fisk I, Holzman B, Melo A, Metson S, Pordes R, Sheldon P, Tiradani A (2011) Using Amazon’s Elastic Compute Cloud to dynamically scale CMS computational resources. J Phys Conf Ser 331:062031
Amazon Web Services 2016 AWS Offers Data Egress Discount to Researchers (2016). https://aws.amazon.com/blogs/publicsector/aws-offers-data-egress-discount-to-researchers/. Accessed 07 Apr 2017
Fermilab 2015 Fermilab Request For Proposals for Cloud Resources and Services (2015). http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5735. Accessed 07 Apr 2017
Blomer J, Buncic P, Charalampidis I, Harutyunyan A, Larsen D, Meusel R (2012) Status and future perspectives of CernVM-FS. J Phys Conf Ser 396:052013
Blumenfeld B, Dykstra D, Lueking L, Wicklund E (2008) CMS conditions data access using FroNTier. J Phys Conf Ser 119:072007
Timm S, Garzoglio G, Mhashilkar P, Boyd J, Bernabeu G, Sharma N, Peregonow N, Kim H, Noh S, Palur S, Raicu I (2015) Cloud services for fermilab stakeholders. J Phys Conf Ser 664:022039
Amazon EC2 Instance Types (2016). https://aws.amazon.com/ec2/instance-types/. Accessed 07 Apr 2017
Levshina T, Sehgal C, Bockelman B, Weitzel D, Guru A (2014) Grid accounting service: state and future development. J Phys Conf Ser 513:032056
Sfiligoi I, Bradley DC, Holzman B, Mhashilkar P, Padhi S, Würthwein F (2009) The pilot way to grid resources using glideinWMS. Proceedings of the 2009 WRI World Congress on computer science and information engineering—Volume 02 (CSIE ’09). IEEE Computer Society, Washington, DC, pp 428–432
Mhashilkar P, Tiradani A, Holzman B, Larson K, Sfiligoi I, Rynge M (2014) Cloud bursting with glideinWMS: means to satisfy ever increasing computing needs for scientific workflows. J Phys Conf Ser 513:032069
Thain D, Tannenbaum T, Livny M (2005) Distributed computing in practice: the condor experience. Concur Pract Exp 17:323–356
Balcas J, Bockelman B, Hufnagel D, Hurtado Anampa K, Aftab Khan F, Larson K, Letts J, Marra da Silva J, Mascheroni M, Mason D, Perez-Calero Yzquierdo A, Tiradani A (2017) Stability and scalability of the CMS Global Pool: pushing HTCondor and glideinWMS to new limits. In: Paper presented at 22nd international conference on computing in high energy and nuclear physics (CHEP 2016), San Francisco, California, 11 October 2016
Cinquilli M, Evans D, Foulkes S, Hufnagel D, Mascheroni M, Norman M, Maxa Z, Melo A, Metson S, Riahi H, Ryu S, Spiga D, Vaandering E, Wakefield S, Wilkinson R (2012) The CMS workload management system. J Phys Conf Ser 396:032113
Wu H, Ren S, Timm S, Garzoglio G and Noh S (2015) Experimental study of bidding strategies for scientific workflows using AWS spot instances. Paper presented at the 8th IEEE workshop on many-task computing on grids and supercomputers (MTAGS), Austin, Texas, 15 November 2015. http://datasys.cs.iit.edu/events/MTAGS15/p02.pdf. Accessed 05 Jul 2017
Jones CD, Paterno M, Kowalkowski J, Sexton-Kennedy L, Tanenbaum W (2006) The new CMS event data model and framework. Proc CHEP 2006 1:248–251
Peters AJ, Sindrilaru EA, Adde G (2015) EOS as the present and future solution for data storage at CERN. J Phys Conf Ser 664:042042
Berkeley Storage Manager (BeStMan) (2017). https://sdm.lbl.gov/bestman/. Accessed 22 Jun 2017
Fuess S (2016) Fermilab facility service costing. Unpublished; private communication
Timm S, Cooper R G, Fuess S, Garzoglio G, Grassano D, Holzman B, Kennedy R, Kim H, Krishnamurthy R, Ren S, Tiradani A, Vinayagam S, Wu H (2017) Virtual machine provisioning, code management and data movement design for the fermilab HEPCloud facility. In: Paper presented at 22nd international conference on computing in high energy and nuclear physics (CHEP 2016), San Francisco, California, 13 October 2016
Buitrago P, Fuess S, Garzoglio G, Himmel A, Holzman B, Kennedy R, Kim H, Norman A, Spentzouris P, Timm S, Tiradani A (2016). The NOvA experience on HEPCloud: Amazon Web Services Demonstration. http://cd-docdb.fnal.gov/cgi-bin/RetrieveFile?docid=5774. Accessed 07 Apr 2017
Worldwide LHC Computing Grid (2016). http://wlcg.web.cern.ch. Accessed 07 Apr 2017
Acknowledgements
This work was partially supported by Fermilab operated by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the United States Department of Energy, the National Science Foundation under Grant ACI-1450377, Cooperative Agreement PHY-1120138, and the AWS Cloud Credits for Research program. On behalf of all authors, the corresponding author states that there is no conflict of interest.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Holzman, B., Bauerdick, L.A.T., Bockelman, B. et al. HEPCloud, a New Paradigm for HEP Facilities: CMS Amazon Web Services Investigation. Comput Softw Big Sci 1, 1 (2017). https://doi.org/10.1007/s41781-017-0001-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41781-017-0001-9