TY - GEN
T1 - Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture
AU - Heinecke, Alexander
AU - Trinitis, Carsten
AU - Weidendorfer, Josef
PY - 2010
Y1 - 2010
N2 - Cache-obliviousness represents an important but relatively new concept for cache optimization. As cache-oblivious algorithms perform well on architectures with arbitrary cache configurations, the programming effort required for porting and optimizing for future architectures can be significantly reduced. In [8] and [9], fast parallel cache-oblivious linear algebra modules have been presented. The underlying matrix storing schemes are based on space filling curves. For matrix multiplication, all cache misses can be avoided, whereas for the LU decomposition algorithm the number of cache misses is minimized. It has been shown that the resulting codes work very well on several kinds of systems ranging from laptops to supercomputers. In this paper, we will show that the runtime characteristics of our existing cache-oblivious codes can be preserved on newer Intel processors. Special emphasis is put on the first many-core processor architecture with complete hardware-based cache coherency: The Larrabee Architecture. As the latter is expected to be available as a PCIe card connected to the host system, porting had to take into account transfer of data structures between different memory address spaces. Unfortunately, Larrabee was canceled as a graphics device for 2010, but Intel is expected to outline futher steps about Larrabee during 2010.
AB - Cache-obliviousness represents an important but relatively new concept for cache optimization. As cache-oblivious algorithms perform well on architectures with arbitrary cache configurations, the programming effort required for porting and optimizing for future architectures can be significantly reduced. In [8] and [9], fast parallel cache-oblivious linear algebra modules have been presented. The underlying matrix storing schemes are based on space filling curves. For matrix multiplication, all cache misses can be avoided, whereas for the LU decomposition algorithm the number of cache misses is minimized. It has been shown that the resulting codes work very well on several kinds of systems ranging from laptops to supercomputers. In this paper, we will show that the runtime characteristics of our existing cache-oblivious codes can be preserved on newer Intel processors. Special emphasis is put on the first many-core processor architecture with complete hardware-based cache coherency: The Larrabee Architecture. As the latter is expected to be available as a PCIe card connected to the host system, porting had to take into account transfer of data structures between different memory address spaces. Unfortunately, Larrabee was canceled as a graphics device for 2010, but Intel is expected to outline futher steps about Larrabee during 2010.
KW - accelerator space-filling curve
KW - cache-oblivious
KW - lu decomposition
KW - manycore
KW - matrix multiplication
KW - openmp
UR - http://www.scopus.com/inward/record.url?scp=77954471964&partnerID=8YFLogxK
U2 - 10.1145/1787275.1787298
DO - 10.1145/1787275.1787298
M3 - Conference contribution
AN - SCOPUS:77954471964
SN - 9781450300445
T3 - CF 2010 - Proceedings of the 2010 Computing Frontiers Conference
SP - 91
EP - 92
BT - CF 2010 - Proceedings of the 2010 Computing Frontiers Conference
T2 - 7th ACM International Conference on Computing Frontiers, CF'10
Y2 - 17 May 2010 through 19 May 2010
ER -