TY - JOUR
T1 - DynaCo
T2 - Dynamic Coherence Management for Tiled Manycore Architectures
AU - Srivatsa, Akshay
AU - Mansour, Mostafa
AU - Rheindt, Sven
AU - Gabriel, Dirk
AU - Wild, Thomas
AU - Herkersdorf, Andreas
N1 - Publisher Copyright:
© 2021, Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2021/8
Y1 - 2021/8
N2 - Embedded system applications, with their inherently limited parallelism, rarely exploit all available processing resources in large DSM-based manycore architectures. From a cache coherence perspective, this provides an opportunity to move away from global coherence spanning across all tiles, which does not scale well. Therefore, we favor a region-based cache coherence (RBCC) approach that enables coherence among a selectable cluster of tiles in accordance with application requirements. We present the design and hardware implementation of a flexibly configurable coherency region manager (CRM) that enables RBCC. We introduce two novel features that enhance RBCC, namely, runtime coherency region re-configuration and RBCC-malloc(), that dynamically tailor coherence to actually shared application working sets. Further, we propose, implement and evaluate additional CRM functions such as a non-intrusive barrier synchronization mechanism and a false sharing resolution strategy for our DSM-based manycore architecture. We have synthesized the CRM on an FPGA prototype for a 64-core system and observe a 38% reduction in BRAM-utilization compared to a global coherence directory for regions with up to 32 cores. Experiments using a video streaming application reveal a speed-up of up to 42% compared to an alternative message passing based implementation. We also evaluate the benefits of runtime coherency region re-configuration using two scenarios and present a formal analysis on when a re-configuration is beneficial.
AB - Embedded system applications, with their inherently limited parallelism, rarely exploit all available processing resources in large DSM-based manycore architectures. From a cache coherence perspective, this provides an opportunity to move away from global coherence spanning across all tiles, which does not scale well. Therefore, we favor a region-based cache coherence (RBCC) approach that enables coherence among a selectable cluster of tiles in accordance with application requirements. We present the design and hardware implementation of a flexibly configurable coherency region manager (CRM) that enables RBCC. We introduce two novel features that enhance RBCC, namely, runtime coherency region re-configuration and RBCC-malloc(), that dynamically tailor coherence to actually shared application working sets. Further, we propose, implement and evaluate additional CRM functions such as a non-intrusive barrier synchronization mechanism and a false sharing resolution strategy for our DSM-based manycore architecture. We have synthesized the CRM on an FPGA prototype for a 64-core system and observe a 38% reduction in BRAM-utilization compared to a global coherence directory for regions with up to 32 cores. Experiments using a video streaming application reveal a speed-up of up to 42% compared to an alternative message passing based implementation. We also evaluate the benefits of runtime coherency region re-configuration using two scenarios and present a formal analysis on when a re-configuration is beneficial.
KW - Coherence barrier
KW - DSM systems
KW - Dynamic on-demand coherence
KW - False sharing
KW - Runtime re-configuration
KW - Scalable coherence
UR - http://www.scopus.com/inward/record.url?scp=85098584510&partnerID=8YFLogxK
U2 - 10.1007/s10766-020-00688-6
DO - 10.1007/s10766-020-00688-6
M3 - Article
AN - SCOPUS:85098584510
SN - 0885-7458
VL - 49
SP - 570
EP - 599
JO - International Journal of Parallel Programming
JF - International Journal of Parallel Programming
IS - 4
ER -