TY - GEN
T1 - Analysis of TPC-DS - The first standard benchmark for SQL-based big data systems
AU - Poess, Meikel
AU - Rabl, Tilmann
AU - Jacobsen, Hans Arno
PY - 2017/9/24
Y1 - 2017/9/24
N2 - The advent of Web 2.0 companies, such as Facebook, Google, and Amazon with their insatiable appetite for vast amounts of structured, semi-structured, and unstructured data, triggered the development of Hadoop and related tools, e.g., YARN, MapReduce, and Pig, as well as NoSQL databases. These tools form an open source software stack to support the processing of large and diverse data sets on clustered systems to perform decision support tasks. Recently, SQL is resurrecting in many of these solutions, e.g., Hive, Stinger, Impala, Shark, and Presto. At the same time, RDBMS vendors are adding Hadoop support into their SQL engines, e.g., IBM's Big SQL, Actian's Vortex, Oracle's Big Data SQL, and SAP's HANA. Because there was no industry standard benchmark that could measure the performance of SQL-based big data solutions, marketing claims were mostly based on "cherry picked" subsets of the TPC-DS benchmark to suit individual companies strengths, while blending out their weaknesses. In this paper, we present and analyze our work on modifying TPC-DS to fill the void for an industry standard benchmark that is able to measure the performance of SQL-based big data solutions. The new benchmark was ratified by the TPC in early 2016. To show the significance of the new benchmark, we analyze performance data obtained on four different systems running big data, traditional RDBMS, and columnar in-memory architectures.
AB - The advent of Web 2.0 companies, such as Facebook, Google, and Amazon with their insatiable appetite for vast amounts of structured, semi-structured, and unstructured data, triggered the development of Hadoop and related tools, e.g., YARN, MapReduce, and Pig, as well as NoSQL databases. These tools form an open source software stack to support the processing of large and diverse data sets on clustered systems to perform decision support tasks. Recently, SQL is resurrecting in many of these solutions, e.g., Hive, Stinger, Impala, Shark, and Presto. At the same time, RDBMS vendors are adding Hadoop support into their SQL engines, e.g., IBM's Big SQL, Actian's Vortex, Oracle's Big Data SQL, and SAP's HANA. Because there was no industry standard benchmark that could measure the performance of SQL-based big data solutions, marketing claims were mostly based on "cherry picked" subsets of the TPC-DS benchmark to suit individual companies strengths, while blending out their weaknesses. In this paper, we present and analyze our work on modifying TPC-DS to fill the void for an industry standard benchmark that is able to measure the performance of SQL-based big data solutions. The new benchmark was ratified by the TPC in early 2016. To show the significance of the new benchmark, we analyze performance data obtained on four different systems running big data, traditional RDBMS, and columnar in-memory architectures.
KW - Benchmark
KW - Big data
KW - TPC-DS
UR - http://www.scopus.com/inward/record.url?scp=85032437411&partnerID=8YFLogxK
U2 - 10.1145/3127479.3128603
DO - 10.1145/3127479.3128603
M3 - Conference contribution
AN - SCOPUS:85032437411
T3 - SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing
SP - 573
EP - 585
BT - SoCC 2017 - Proceedings of the 2017 Symposium on Cloud Computing
PB - Association for Computing Machinery, Inc
T2 - 2017 Symposium on Cloud Computing, SoCC 2017
Y2 - 24 September 2017 through 27 September 2017
ER -