Skip to main content
Log in

HEPCloud, a New Paradigm for HEP Facilities: CMS Amazon Web Services Investigation

  • Original Article
  • Published:
Computing and Software for Big Science Aims and scope Submit manuscript

Abstract

Historically, high energy physics computing has been performed on large purpose-built computing systems. These began as single-site compute facilities, but have evolved into the distributed computing grids used today. Recently, there has been an exponential increase in the capacity and capability of commercial clouds. Cloud resources are highly virtualized and intended to be able to be flexibly deployed for a variety of computing tasks. There is a growing interest among the cloud providers to demonstrate the capability to perform large-scale scientific computing. In this paper, we discuss results from the CMS experiment using the Fermilab HEPCloud facility, which utilized both local Fermilab resources and virtual machines in the Amazon Web Services Elastic Compute Cloud. We discuss the planning, technical challenges, and lessons learned involved in performing physics workflows on a large-scale set of virtualized resources. In addition, we will discuss the economics and operational efficiencies when executing workflows both in the cloud and on dedicated resources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Optimizing Web Delivery, http://www.squid-cache.org.

  2. https://aws.amazon.com/cloudformation/.

  3. https://aws.amazon.com/elasticloadbalancing/.

  4. https://aws.amazon.com/autoscaling/.

  5. https://aws.amazon.com/route53/.

  6. https://aws.amazon.com/cloudwatch/.

  7. https://aws.amazon.com/s3/.

  8. AWS provides an API to provision both individual machines and in bulk (“spot fleet”). At the time of our demonstration, our underlying provisioning tools did not support spot fleet.

  9. The open platform for analytics and monitoring, https://grafana.com.

  10. http://dashboard.cern.ch.

  11. https://www.elastic.io.

  12. https://aws.amazon.com/ec2/instance-types/.

  13. Instance types that provided smaller contributions are not included.

  14. https://github.com/holzman/glidein-scripts.

  15. The EOS storage system [30] implements the SRM protocol by deploying the BeStMan [31] software package, which is not well-supported. We switched to using the xrootd protocol, which is supported natively by EOS.

References

  1. Pordes R, Petravick D, Kramer B, Olson D, Livny M, Roy A, Avery P, Blackburn K, Wenaus T, Würthwein F, Foster I, Gardner R, Wilde M, Blatecky A, McGee J, Quick R (2007) The open science grid. J Phys Conf Ser 78:012057

    Article  Google Scholar 

  2. Evans L, Bryant P (2008) LHC machine. J Instrum 3:S08001

    Article  Google Scholar 

  3. Apollinari G, Alonso I B, Brüning O, Lamont M, Rossi L (eds) (2015) High-luminosity large hadron collider (HL-LHC) : preliminary design report. http://cds.cern.ch/record/2116337. Accessed 11 Jul 2017

  4. Particle Physics Project Prioritization Panel (2014). Building for discovery: strategic plan for US particle physics in the global context. https://science.energy.gov/~/media/hep/hepap/pdf/May-2014/FINAL_P5_Report_053014.pdf. Accessed 11 Jul 2017

  5. Mahmood Z, Saeed S (eds) (2013) Software engineering frameworks for the cloud computing paradigm. Springer, London

    Google Scholar 

  6. Augé E, Dumarchez J, Tran Thanh Van J (eds) (2016) Proceedings, 51st Rencontres de Moriond on QCD and High Energy Interactions : La Thuile, Italy, March 19-26, 2016. ARISF

  7. Augé E, Dumarchez J, Tran Thanh Van J (eds) (2016) Proceedings, 51st Rencontres de Moriond on Electroweak Interactions and Unified Theories : La Thuile, Italy, March 12-19, 2016. ARISF

  8. CMS Public Physics Results (2016). http://cms-results.web.cern.ch/cms-results/public-results/publications/. Accessed 07 Apr 2017

  9. Taylor R, Berghaus F, Brasolin F, Cordeiro C, Desmarais R, Field L, Gable I, Giordano D, Di Girolamo A, Hover J, LeBlanc M, Love P, Paterson M, Sobie R, Zaytsev A (2015) The evolution of cloud computing in ATLAS. J Phys Conf Ser 664:022038

    Article  Google Scholar 

  10. Benjamin D, Caballero J, Ernst M, Guan W, Hover J, Lesny D, Maeno T, Nilsson P, Tsulaia V, van Gemmeren P, Vaniachine A, Wang F, Wenaus T (2016) Scaling up ATLAS event service to production levels on opportunistic computing platforms. J Phys Conf Ser 762:012027

    Article  Google Scholar 

  11. Grzymkowski R, Hara T (2015) Belle II public and private cloud management in VMDIRAC system. J Phys Conf Ser 664:022021

    Article  Google Scholar 

  12. Andronis A, Bauer D, Chaze O, Colling D, Dobson M, Fayer S, Girone M, Grandi C, Huffman A, Hufnagel D, Khan F, Lahiff A, McCrea A, Rand D, Sgaravatto M, Tiradani A, Zhang X (2015) The diverse use of clouds by CMS. J Phys Conf Ser 664:022012

    Article  Google Scholar 

  13. Gartner Group 2015 Magic Quadrant for Cloud Infrastructure as a Service, Worldwide (2015). https://www.gartner.com/doc/reprints?id=1-2G45TQU&ct=150519&st=sb. Accessed 07 Apr 2017

  14. Andrews W, Bockelman B, Bradley B, Dost J, Evans D, Fisk I, Frey J, Holzman B, Livny M, Martin T, McCrea A, Melo A, Metson S, Pi H, Sfiligoi I, Sheldon P, Tannenbaum T, Tiradani A, Würthwein F, Weitzel D (2011) Early experience on using glideinWMS in the cloud. J Phys Conf Ser 331:062014

    Article  Google Scholar 

  15. Evans D, Fisk I, Holzman B, Melo A, Metson S, Pordes R, Sheldon P, Tiradani A (2011) Using Amazon’s Elastic Compute Cloud to dynamically scale CMS computational resources. J Phys Conf Ser 331:062031

    Article  Google Scholar 

  16. Amazon Web Services 2016 AWS Offers Data Egress Discount to Researchers (2016). https://aws.amazon.com/blogs/publicsector/aws-offers-data-egress-discount-to-researchers/. Accessed 07 Apr 2017

  17. Fermilab 2015 Fermilab Request For Proposals for Cloud Resources and Services (2015). http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5735. Accessed 07 Apr 2017

  18. Blomer J, Buncic P, Charalampidis I, Harutyunyan A, Larsen D, Meusel R (2012) Status and future perspectives of CernVM-FS. J Phys Conf Ser 396:052013

    Article  Google Scholar 

  19. Blumenfeld B, Dykstra D, Lueking L, Wicklund E (2008) CMS conditions data access using FroNTier. J Phys Conf Ser 119:072007

    Article  Google Scholar 

  20. Timm S, Garzoglio G, Mhashilkar P, Boyd J, Bernabeu G, Sharma N, Peregonow N, Kim H, Noh S, Palur S, Raicu I (2015) Cloud services for fermilab stakeholders. J Phys Conf Ser 664:022039

    Article  Google Scholar 

  21. Amazon EC2 Instance Types (2016). https://aws.amazon.com/ec2/instance-types/. Accessed 07 Apr 2017

  22. Levshina T, Sehgal C, Bockelman B, Weitzel D, Guru A (2014) Grid accounting service: state and future development. J Phys Conf Ser 513:032056

    Article  Google Scholar 

  23. Sfiligoi I, Bradley DC, Holzman B, Mhashilkar P, Padhi S, Würthwein F (2009) The pilot way to grid resources using glideinWMS. Proceedings of the 2009 WRI World Congress on computer science and information engineering—Volume 02 (CSIE ’09). IEEE Computer Society, Washington, DC, pp 428–432

    Chapter  Google Scholar 

  24. Mhashilkar P, Tiradani A, Holzman B, Larson K, Sfiligoi I, Rynge M (2014) Cloud bursting with glideinWMS: means to satisfy ever increasing computing needs for scientific workflows. J Phys Conf Ser 513:032069

    Article  Google Scholar 

  25. Thain D, Tannenbaum T, Livny M (2005) Distributed computing in practice: the condor experience. Concur Pract Exp 17:323–356

    Article  Google Scholar 

  26. Balcas J, Bockelman B, Hufnagel D, Hurtado Anampa K, Aftab Khan F, Larson K, Letts J, Marra da Silva J, Mascheroni M, Mason D, Perez-Calero Yzquierdo A, Tiradani A (2017) Stability and scalability of the CMS Global Pool: pushing HTCondor and glideinWMS to new limits. In: Paper presented at 22nd international conference on computing in high energy and nuclear physics (CHEP 2016), San Francisco, California, 11 October 2016

  27. Cinquilli M, Evans D, Foulkes S, Hufnagel D, Mascheroni M, Norman M, Maxa Z, Melo A, Metson S, Riahi H, Ryu S, Spiga D, Vaandering E, Wakefield S, Wilkinson R (2012) The CMS workload management system. J Phys Conf Ser 396:032113

    Article  Google Scholar 

  28. Wu H, Ren S, Timm S, Garzoglio G and Noh S (2015) Experimental study of bidding strategies for scientific workflows using AWS spot instances. Paper presented at the 8th IEEE workshop on many-task computing on grids and supercomputers (MTAGS), Austin, Texas, 15 November 2015. http://datasys.cs.iit.edu/events/MTAGS15/p02.pdf. Accessed 05 Jul 2017

  29. Jones CD, Paterno M, Kowalkowski J, Sexton-Kennedy L, Tanenbaum W (2006) The new CMS event data model and framework. Proc CHEP 2006 1:248–251

    Google Scholar 

  30. Peters AJ, Sindrilaru EA, Adde G (2015) EOS as the present and future solution for data storage at CERN. J Phys Conf Ser 664:042042

    Article  Google Scholar 

  31. Berkeley Storage Manager (BeStMan) (2017). https://sdm.lbl.gov/bestman/. Accessed 22 Jun 2017

  32. Fuess S (2016) Fermilab facility service costing. Unpublished; private communication

  33. Timm S, Cooper R G, Fuess S, Garzoglio G, Grassano D, Holzman B, Kennedy R, Kim H, Krishnamurthy R, Ren S, Tiradani A, Vinayagam S, Wu H (2017) Virtual machine provisioning, code management and data movement design for the fermilab HEPCloud facility. In: Paper presented at 22nd international conference on computing in high energy and nuclear physics (CHEP 2016), San Francisco, California, 13 October 2016

  34. Buitrago P, Fuess S, Garzoglio G, Himmel A, Holzman B, Kennedy R, Kim H, Norman A, Spentzouris P, Timm S, Tiradani A (2016). The NOvA experience on HEPCloud: Amazon Web Services Demonstration. http://cd-docdb.fnal.gov/cgi-bin/RetrieveFile?docid=5774. Accessed 07 Apr 2017

  35. Worldwide LHC Computing Grid (2016). http://wlcg.web.cern.ch. Accessed 07 Apr 2017

Download references

Acknowledgements

This work was partially supported by Fermilab operated by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the United States Department of Energy, the National Science Foundation under Grant ACI-1450377, Cooperative Agreement PHY-1120138, and the AWS Cloud Credits for Research program. On behalf of all authors, the corresponding author states that there is no conflict of interest.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Burt Holzman.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Holzman, B., Bauerdick, L.A.T., Bockelman, B. et al. HEPCloud, a New Paradigm for HEP Facilities: CMS Amazon Web Services Investigation. Comput Softw Big Sci 1, 1 (2017). https://doi.org/10.1007/s41781-017-0001-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41781-017-0001-9

Keywords

Navigation