TY - GEN
T1 - Production Hardware Overprovisioning
T2 - 31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017
AU - Sakamoto, Ryuichi
AU - Cao, Thang
AU - Kondo, Masaaki
AU - Inoue, Koji
AU - Ueda, Masatsugu
AU - Patki, Tapasya
AU - Ellsworth, Daniel
AU - Rountree, Barry
AU - Schulz, Martin
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/6/30
Y1 - 2017/6/30
N2 - Limited power budgets will be one of the biggest challenges for deploying future exascale supercomputers. One of the promising ways to deal with this challenge is hardware overprovisioning, that is, installingmore hardware resources than can be fully powered under a given power limit coupled with software mechanisms to steer the limited power to where it is needed most. Prior research has demonstrated the viability of this approach, but could only rely on small-scale simulations of the software stack. While such research is useful to understand the boundaries of performance benefits that can be achieved, it does not cover any deployment or operational concerns of using overprovisioning on production systems. This paper is the first to present an extensible power-aware resource management framework for production-sized overprovisioned systems based on the widely established SLURM resource manager. Our framework provides flexible plugin interfaces and APIs for power management that can be easily extended to implement site-specific strategies and for comparison of different power management techniques. We demonstrate our framework on a 965-node HA8000 production system at Kyushu University. Our results indicate that it is indeed possible to safely overprovision hardware in production. We also find that the power consumption of idle nodes, which depends on the degree of overprovisioning, can become a bottleneck. Using real-world data, we then draw conclusions about the impact of the total number of nodes provided in an overprovisioned environment.
AB - Limited power budgets will be one of the biggest challenges for deploying future exascale supercomputers. One of the promising ways to deal with this challenge is hardware overprovisioning, that is, installingmore hardware resources than can be fully powered under a given power limit coupled with software mechanisms to steer the limited power to where it is needed most. Prior research has demonstrated the viability of this approach, but could only rely on small-scale simulations of the software stack. While such research is useful to understand the boundaries of performance benefits that can be achieved, it does not cover any deployment or operational concerns of using overprovisioning on production systems. This paper is the first to present an extensible power-aware resource management framework for production-sized overprovisioned systems based on the widely established SLURM resource manager. Our framework provides flexible plugin interfaces and APIs for power management that can be easily extended to implement site-specific strategies and for comparison of different power management techniques. We demonstrate our framework on a 965-node HA8000 production system at Kyushu University. Our results indicate that it is indeed possible to safely overprovision hardware in production. We also find that the power consumption of idle nodes, which depends on the degree of overprovisioning, can become a bottleneck. Using real-world data, we then draw conclusions about the impact of the total number of nodes provided in an overprovisioned environment.
KW - Design of resource manager
KW - Overprovisioned
KW - Plugin interface
KW - Power characteristics of HPC system
KW - Power-constrained HPC system
UR - http://www.scopus.com/inward/record.url?scp=85027722567&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2017.107
DO - 10.1109/IPDPS.2017.107
M3 - Conference contribution
AN - SCOPUS:85027722567
T3 - Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017
SP - 957
EP - 966
BT - Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 29 May 2017 through 2 June 2017
ER -