TY - GEN
T1 - Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms
AU - Heinecke, Alexander
AU - Bader, Michael
PY - 2008
Y1 - 2008
N2 - We present a parallel implementation of a cache oblivious algorithm for matrix multiplication on multicore platforms. The algorithm is based on a storage scheme and a block-recursive approach for multiplication, which are both based on a Peano space-filling curve. The recursion is stopped on matrix blocks with a size that needs to perfectly match the size of the Ll cache of the underlying CPU. The respective block multiplications are implemented by multiplication kernels that are hand-optimised for the SIMD units of current x86 CPUs. The Peano storage scheme is used to partition the block multiplications to different cores. Performance tests on various multicore platforms with up to 16 cores and different memory architecture show that the resulting implementation leads to better parallel scalability than achieved by Intel's MKL or GotoBLAS, and can outperform both libraries in terms of absolute performance on eight or more cores.
AB - We present a parallel implementation of a cache oblivious algorithm for matrix multiplication on multicore platforms. The algorithm is based on a storage scheme and a block-recursive approach for multiplication, which are both based on a Peano space-filling curve. The recursion is stopped on matrix blocks with a size that needs to perfectly match the size of the Ll cache of the underlying CPU. The respective block multiplications are implemented by multiplication kernels that are hand-optimised for the SIMD units of current x86 CPUs. The Peano storage scheme is used to partition the block multiplications to different cores. Performance tests on various multicore platforms with up to 16 cores and different memory architecture show that the resulting implementation leads to better parallel scalability than achieved by Intel's MKL or GotoBLAS, and can outperform both libraries in terms of absolute performance on eight or more cores.
KW - Cache oblivious algorithms
KW - Matrix multiplication
KW - Multicore
KW - Parallelisation
KW - Space-filling curves
UR - http://www.scopus.com/inward/record.url?scp=56849103298&partnerID=8YFLogxK
U2 - 10.1145/1366219.1366223
DO - 10.1145/1366219.1366223
M3 - Conference contribution
AN - SCOPUS:56849103298
SN - 9781605580913
T3 - Conference on Computing Frontiers - Proceedings of the 2008 Workshop on Memory Access on Future Processors: A Solved Problem MAW'08
SP - 385
EP - 392
BT - Conference on Computing Frontiers - Proceedings of the 2008 Workshop on Memory Access on Future Processors
T2 - 2008 Workshop on Memory Access on Future Processors: A Solved Problem MAW'08
Y2 - 5 May 2008 through 7 May 2008
ER -