TY - GEN
T1 - Workload-aware data partitioning in community-driven data grids
AU - Scholl, Tobias
AU - Bauer, Bernhard
AU - Müller, Jessica
AU - Gufler, Benjamin
AU - Reiser, Angelika
AU - Kemper, Alfons
PY - 2009
Y1 - 2009
N2 - Collaborative research in various scientific disciplines requires support for scalable data management enabling the efficient correlation of globally distributed data sources. Motivated by the expected data rates of upcoming projects and a growing number of users, communities explore new data management techniques for achieving high throughput. Community-driven data grids deliver such high-throughput data distribution for scientific federations by partitioning data according to application-specific data and query characteristics. Query hot spots are an important and challenging problem in this environment. Existing approaches to load-balancing from Peer-to-Peer (P2P) data management and sensor networks do not directly meet the requirements of a data-intensive e-science environment. In this paper, our contributions are partitioning schemes based on multi-dimensional index structures enabling communities to trade off data load balancing and handling query hot spots via splitting and replication. We evaluate the partitioning schemes with two typical kinds of data sets from the astrophysics domain and workloads extracted from Sloan Digital Sky Survey (SDSS) query traces and perform throughput measurements in real and simulated networks. The experiments demonstrate the improved workload distribution capabilities and give promising directions for the development of future community grids.
AB - Collaborative research in various scientific disciplines requires support for scalable data management enabling the efficient correlation of globally distributed data sources. Motivated by the expected data rates of upcoming projects and a growing number of users, communities explore new data management techniques for achieving high throughput. Community-driven data grids deliver such high-throughput data distribution for scientific federations by partitioning data according to application-specific data and query characteristics. Query hot spots are an important and challenging problem in this environment. Existing approaches to load-balancing from Peer-to-Peer (P2P) data management and sensor networks do not directly meet the requirements of a data-intensive e-science environment. In this paper, our contributions are partitioning schemes based on multi-dimensional index structures enabling communities to trade off data load balancing and handling query hot spots via splitting and replication. We evaluate the partitioning schemes with two typical kinds of data sets from the astrophysics domain and workloads extracted from Sloan Digital Sky Survey (SDSS) query traces and perform throughput measurements in real and simulated networks. The experiments demonstrate the improved workload distribution capabilities and give promising directions for the development of future community grids.
UR - http://www.scopus.com/inward/record.url?scp=70349139139&partnerID=8YFLogxK
U2 - 10.1145/1516360.1516366
DO - 10.1145/1516360.1516366
M3 - Conference contribution
AN - SCOPUS:70349139139
SN - 9781605584225
T3 - Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09
SP - 36
EP - 47
BT - Proceedings of the 12th International Conference on Extending Database Technology
T2 - 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09
Y2 - 24 March 2009 through 26 March 2009
ER -