TY - GEN
T1 - Runtime MPI collective checking with tree-based overlay networks
AU - Hilbrich, Tobias
AU - Hänsel, Fabian
AU - Schulz, Martin
AU - De Supinski, Bronis R.
AU - Müller, Matthias S.
AU - Nagel, Wolfgang E.
PY - 2013
Y1 - 2013
N2 - Runtime error detection tools detect many classes of MPI usage errors, including errors in collective communication calls. However, they often face scalability challenges. We present runtime checks for MPI collective operations that use a Tree-Based Overlay Network (TBON) for scalability and that provide full datatype matching. While we can use transitive correctness properties for most checks, some collective operations impose non-transitive correctness properties, e.g., MPI-Alltoallv, where we use an intralayer communication within the TBON to distribute datatype matching information. An overhead study with stress tests and two benchmark suites demonstrates applicability and scalability at 4,096, 2,048 and 16,384 processes respectively.
AB - Runtime error detection tools detect many classes of MPI usage errors, including errors in collective communication calls. However, they often face scalability challenges. We present runtime checks for MPI collective operations that use a Tree-Based Overlay Network (TBON) for scalability and that provide full datatype matching. While we can use transitive correctness properties for most checks, some collective operations impose non-transitive correctness properties, e.g., MPI-Alltoallv, where we use an intralayer communication within the TBON to distribute datatype matching information. An overhead study with stress tests and two benchmark suites demonstrates applicability and scalability at 4,096, 2,048 and 16,384 processes respectively.
KW - Correctness
KW - MPI collectives
KW - Tree-based overlay networks
UR - http://www.scopus.com/inward/record.url?scp=84886250838&partnerID=8YFLogxK
U2 - 10.1145/2488551.2488570
DO - 10.1145/2488551.2488570
M3 - Conference contribution
AN - SCOPUS:84886250838
SN - 9788461651337
T3 - ACM International Conference Proceeding Series
SP - 129
EP - 134
BT - Proceedings of the 20th European MPI Users' Group Meeting, EuroMPI 2013
PB - Association for Computing Machinery
T2 - 20th European MPI Users' Group Meeting, EuroMPI 2013
Y2 - 15 September 2013 through 18 September 2013
ER -