Flux: A Next-Generation Resource Management Framework for Large HPC Centers

Dong H. Ahn, Jim Garlick, Mark Grondona, Don Lipari, Becky Springmeyer, Martin Schulz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

41 Scopus citations

Abstract

Resource and job management software is crucial to High Performance Computing (HPC) for efficient application execution. However, current systems and approaches can no longer keep up with the challenges large HPC centers are facing due to ever-increasing system scales, resource and workload diversity, interplays between various resources (e.g., between compute clusters and a global file system), and complexity of resource constraints such as strict power budgeting. To address this gap, we propose Flux, an extensible job and resource management framework specifically designed to deal with the requirements of next-generation HPC centers. Flux targets an entire computing facility as one common pool of diverse sets of resources, enabling the facility to accommodate site-wide constraints (e.g., for power limits). Yet, its scalable and distributed design still offers scalable and effective scheduling strategies. This paper details the design of Flux and describes and evaluates our initial prototyping effort of the key run-time components. Our results show that our run-time prototype provides strong and predictable scalability.

Original languageEnglish
Title of host publicationProceedings - 43rd International Conference on Parallel Processing Workshops, ICPPW 2014
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages9-17
Number of pages9
ISBN (Electronic)9781479956159
DOIs
StatePublished - 7 May 2015
Externally publishedYes
Event43rd International Conference on Parallel Processing Workshops, ICPPW 2014 - Minneapolis, United States
Duration: 9 Sep 201412 Sep 2014

Publication series

NameProceedings of the International Conference on Parallel Processing Workshops
Volume2015-May
ISSN (Print)1530-2016

Conference

Conference43rd International Conference on Parallel Processing Workshops, ICPPW 2014
Country/TerritoryUnited States
CityMinneapolis
Period9/09/1412/09/14

Keywords

  • communication framework
  • key value store
  • resource management
  • run-time
  • scalable process management services

Fingerprint

Dive into the research topics of 'Flux: A Next-Generation Resource Management Framework for Large HPC Centers'. Together they form a unique fingerprint.

Cite this