TY - GEN
T1 - Nemesys
T2 - 2019 International Symposium on Memory Systems, MEMSYS 2019
AU - Rheindt, Sven
AU - Fried, Andreas
AU - Lenke, Oliver
AU - Nolte, Lars
AU - Wild, Thomas
AU - Herkersdorf, Andreas
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/9/30
Y1 - 2019/9/30
N2 - Despite tackling the memory and power walls over the last decades, new challenges for manycore architectures arose due to the emergence of ever increasing memory intensiveness of applications with big, irregular and cache unfriendly data sets. As data-to-task locality is of key importance for system performance, the MEMSYS 2017 keynote speaker Peter Kogge showed evidence for the so-called “locality wall”, that paved the path to near- and in-memory computing. The reduction of data movement is especially challenging on tile-based architectures with physically distributed memory as they often omit inter-tile cache coherence and thus require a different programming model (e.g. PGAS). Inter-tile communication in the PGAS paradigm is allowed via a remote procedure call (RPC)-like programming language construct. The more modern PGAS languages are object-oriented and thus the RPC mechanism requires object graphs to be copied between tiles. It is the system-software’s job to provide an efficient implementation of it since the transfer of such object graphs is crucial for the performance of object-oriented applications on PGAS architectures. We therefore propose NEMESYS: NEar-Memory Graph Copy Enhanced SYstem-Software, which outsources the memory-intensive and cache unfriendly graph copy operation to near-memory hardware accelerators. As NEMESYS is an efficient implementation of the PGAS RPC, it integrates these near-memory accelerators into the system-software, opaque to the application programmer. We integrated NEMESYS into an FPGA prototype and a distributed operating system running on a 4x4-tile design with a total of 56 application cores and two memory tiles. The evaluation with the X10 IMSuite benchmarks, featuring distributed graph algorithm kernels, showed a speedup in execution time between 1.35x and 3.85x compared to a state of the art approach. The overall reduction in communication time was between 40% and 82%.
AB - Despite tackling the memory and power walls over the last decades, new challenges for manycore architectures arose due to the emergence of ever increasing memory intensiveness of applications with big, irregular and cache unfriendly data sets. As data-to-task locality is of key importance for system performance, the MEMSYS 2017 keynote speaker Peter Kogge showed evidence for the so-called “locality wall”, that paved the path to near- and in-memory computing. The reduction of data movement is especially challenging on tile-based architectures with physically distributed memory as they often omit inter-tile cache coherence and thus require a different programming model (e.g. PGAS). Inter-tile communication in the PGAS paradigm is allowed via a remote procedure call (RPC)-like programming language construct. The more modern PGAS languages are object-oriented and thus the RPC mechanism requires object graphs to be copied between tiles. It is the system-software’s job to provide an efficient implementation of it since the transfer of such object graphs is crucial for the performance of object-oriented applications on PGAS architectures. We therefore propose NEMESYS: NEar-Memory Graph Copy Enhanced SYstem-Software, which outsources the memory-intensive and cache unfriendly graph copy operation to near-memory hardware accelerators. As NEMESYS is an efficient implementation of the PGAS RPC, it integrates these near-memory accelerators into the system-software, opaque to the application programmer. We integrated NEMESYS into an FPGA prototype and a distributed operating system running on a 4x4-tile design with a total of 56 application cores and two memory tiles. The evaluation with the X10 IMSuite benchmarks, featuring distributed graph algorithm kernels, showed a speedup in execution time between 1.35x and 3.85x compared to a state of the art approach. The overall reduction in communication time was between 40% and 82%.
KW - Data-to-Task Locality
KW - Graph Copy Accelerator
KW - Near-Memory Computing
KW - PGAS
KW - System-Software
UR - http://www.scopus.com/inward/record.url?scp=85075895443&partnerID=8YFLogxK
U2 - 10.1145/3357526.3357545
DO - 10.1145/3357526.3357545
M3 - Conference contribution
AN - SCOPUS:85075895443
T3 - ACM International Conference Proceeding Series
SP - 3
EP - 18
BT - MEMSYS 2019 - Proceedings of the International Symposium on Memory Systems
PB - Association for Computing Machinery
Y2 - 30 September 2019 through 3 October 2019
ER -