TY - JOUR
T1 - Performance Evaluation and Optimization of Multi-Dimensional Indexes in Hive
AU - Liu, Yue
AU - Guo, Shuai
AU - Hu, Songlin
AU - Rabl, Tilmann
AU - Jacobsen, Hans Arno
AU - Li, Jintao
AU - Wang, Jiye
N1 - Publisher Copyright:
© 2008-2012 IEEE.
PY - 2018/9/1
Y1 - 2018/9/1
N2 - Apache Hive has been widely used for big data processing over large scale clusters by many companies. It provides a declarative query language called HiveQL. The efficiency of filtering out query-irrelevant data from HDFS closely affects the performance of query processing. This is especially true for multi-dimensional, high-selective, and few columns involving queries, which provides sufficient information to reduce the amount of bytes read. Indexing (Compact Index, Aggregate Index, Bitmap Index, DGFIndex, and the index in ORC file) and columnar storage (RCFile, ORC file, and Parquet) are powerful techniques to achieve this. However, it is not trivial to choosing a suitable index and columnar storage based on data and query features. In this paper, we compare the data filtering performance of the above indexes with different columnar storage formats by conducting comprehensive experiments using uniform and skew TPC-H data sets and various multi-dimensional queries, and suggest the best practices of improving multi-dimensional queries in Hive under different conditions.
AB - Apache Hive has been widely used for big data processing over large scale clusters by many companies. It provides a declarative query language called HiveQL. The efficiency of filtering out query-irrelevant data from HDFS closely affects the performance of query processing. This is especially true for multi-dimensional, high-selective, and few columns involving queries, which provides sufficient information to reduce the amount of bytes read. Indexing (Compact Index, Aggregate Index, Bitmap Index, DGFIndex, and the index in ORC file) and columnar storage (RCFile, ORC file, and Parquet) are powerful techniques to achieve this. However, it is not trivial to choosing a suitable index and columnar storage based on data and query features. In this paper, we compare the data filtering performance of the above indexes with different columnar storage formats by conducting comprehensive experiments using uniform and skew TPC-H data sets and various multi-dimensional queries, and suggest the best practices of improving multi-dimensional queries in Hive under different conditions.
KW - Hadoop
KW - Hive
KW - multi-dimensional index
KW - performance evaluation
UR - http://www.scopus.com/inward/record.url?scp=85054850868&partnerID=8YFLogxK
U2 - 10.1109/TSC.2016.2594778
DO - 10.1109/TSC.2016.2594778
M3 - Article
AN - SCOPUS:85054850868
SN - 1939-1374
VL - 11
SP - 835
EP - 849
JO - IEEE Transactions on Services Computing
JF - IEEE Transactions on Services Computing
IS - 5
M1 - 7523210
ER -