TY - GEN
T1 - Exploiting the locality properties of Peano curves for parallel matrix multiplication
AU - Bader, Michael
PY - 2008
Y1 - 2008
N2 - The present work studies an approach to exploit the locality properties of an inherently cache-efficient algorithm for matrix multiplication in a parallel implementation. The algorithm is based on a blockwise element layout and an execution order that are derived from a Peano space-filling curve. The strong locality properties induced in the resulting algorithm motivate a parallel algorithm that replicates matrix blocks in local caches that will prefetch remote blocks before they are used. As a consequence, the block size for matrix multiplication and the cache sizes, and hence the granularity of communication, can be chosen independently. The influence of these parameters on parallel efficiency is studied on a compute cluster with 128 processors. Performance studies show that the largest influence on performance stems from the size of the local caches, which makes the algorithm an interesting option for all situations where memory is scarce, or where existing cache hierarchies can be exploited (as in future manycore environments, e.g.).
AB - The present work studies an approach to exploit the locality properties of an inherently cache-efficient algorithm for matrix multiplication in a parallel implementation. The algorithm is based on a blockwise element layout and an execution order that are derived from a Peano space-filling curve. The strong locality properties induced in the resulting algorithm motivate a parallel algorithm that replicates matrix blocks in local caches that will prefetch remote blocks before they are used. As a consequence, the block size for matrix multiplication and the cache sizes, and hence the granularity of communication, can be chosen independently. The influence of these parameters on parallel efficiency is studied on a compute cluster with 128 processors. Performance studies show that the largest influence on performance stems from the size of the local caches, which makes the algorithm an interesting option for all situations where memory is scarce, or where existing cache hierarchies can be exploited (as in future manycore environments, e.g.).
UR - http://www.scopus.com/inward/record.url?scp=51849136508&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-85451-7_85
DO - 10.1007/978-3-540-85451-7_85
M3 - Conference contribution
AN - SCOPUS:51849136508
SN - 3540854509
SN - 9783540854500
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 801
EP - 810
BT - Euro-Par 2008 Parallel Processing - 14th International Euro-Par Conference, Proceedings
T2 - 14th International Euro-Par Conference, Euro-Par 2008
Y2 - 26 August 2008 through 29 August 2008
ER -