Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi - Threaded Programs

Xiang Fu, Shiman Meng, Weiping Zhang, Luanzheng Guo, Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

After all these years and all these other shared memory programming frameworks, OpenMP is still the most popular one. However, its greater levels of non-deterministic execution makes debugging and testing more challenging. The ability to record and deterministically replay the program execution is key to address this challenge. However, scalably replaying OpenMP programs is still an unresolved problem. In this paper, we propose two novel techniques that use Distributed Clock (DC) and Distributed Epoch (DE) recording schemes to eliminate excessive thread synchronization for OpenMP record and replay. Our evaluation on representative HPC applications with ReOMP, which we used to realize DC and DE recording, shows that our approach is 2-5x more efficient than traditional approaches that synchronize on every shared-memory access. Furthermore, we demonstrate that our approach can be easily combined with MPI-Ievel replay tools to replay non-trivial MPI+OpenMP applications. We achieve this by integrating ReOMP into ReMPI, an existing scalable MPI record-and-replay tool, with only a small MPI-scale-independent runtime overhead.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE International Conference on Cluster Computing, CLUSTER 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages27-38
Number of pages12
ISBN (Electronic)9798350358711
DOIs
StatePublished - 2024
Event2024 IEEE International Conference on Cluster Computing, CLUSTER 2024 - Kobe, Japan
Duration: 24 Sep 202427 Sep 2024

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
ISSN (Print)1552-5244

Conference

Conference2024 IEEE International Conference on Cluster Computing, CLUSTER 2024
Country/TerritoryJapan
CityKobe
Period24/09/2427/09/24

Keywords

  • Non-determinism
  • OpenMP
  • Record-and-Replay
  • Reproducibility

Fingerprint

Dive into the research topics of 'Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi - Threaded Programs'. Together they form a unique fingerprint.

Cite this