Big data generation

Tilmann Rabl, Hans Arno Jacobsen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

15 Scopus citations

Abstract

Big data challenges are end-to-end problems. When handling big data it usually has to be preprocessed, moved, loaded, processed, and stored many times. This has led to the creation of big data pipelines. Current benchmarks related to big data only focus on isolated aspects of this pipeline, usually the processing, storage and loading aspects. To this date, there has not been any benchmark presented covering the end-to-end aspect for big data systems. In this paper, we discuss the necessity of ETL like tasks in big data benchmarking and propose the Parallel Data Generation Framework (PDGF) for its data generation. PDGF is a generic data generator that was implemented at the University of Passau and is currently adopted in TPC benchmarks.

Original languageEnglish
Title of host publicationSpecifying Big Data Benchmarks - First Workshop, WBDB 2012, and Second Workshop, WBDB 2012, Revised Selected Papers
PublisherSpringer Verlag
Pages20-27
Number of pages8
ISBN (Print)9783642539732
DOIs
StatePublished - 2014
Externally publishedYes
Event2nd Workshop on Specifying Big Data Benchmarks, WBDB 2012 - Pune, India
Duration: 17 Dec 201218 Dec 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8163 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference2nd Workshop on Specifying Big Data Benchmarks, WBDB 2012
Country/TerritoryIndia
CityPune
Period17/12/1218/12/12

Fingerprint

Dive into the research topics of 'Big data generation'. Together they form a unique fingerprint.

Cite this