Performance Evaluation and Optimization of Multi-Dimensional Indexes in Hive

Yue Liu, Shuai Guo, Songlin Hu, Tilmann Rabl, Hans Arno Jacobsen, Jintao Li, Jiye Wang

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

Apache Hive has been widely used for big data processing over large scale clusters by many companies. It provides a declarative query language called HiveQL. The efficiency of filtering out query-irrelevant data from HDFS closely affects the performance of query processing. This is especially true for multi-dimensional, high-selective, and few columns involving queries, which provides sufficient information to reduce the amount of bytes read. Indexing (Compact Index, Aggregate Index, Bitmap Index, DGFIndex, and the index in ORC file) and columnar storage (RCFile, ORC file, and Parquet) are powerful techniques to achieve this. However, it is not trivial to choosing a suitable index and columnar storage based on data and query features. In this paper, we compare the data filtering performance of the above indexes with different columnar storage formats by conducting comprehensive experiments using uniform and skew TPC-H data sets and various multi-dimensional queries, and suggest the best practices of improving multi-dimensional queries in Hive under different conditions.

Original languageEnglish
Article number7523210
Pages (from-to)835-849
Number of pages15
JournalIEEE Transactions on Services Computing
Volume11
Issue number5
DOIs
StatePublished - 1 Sep 2018
Externally publishedYes

Keywords

  • Hadoop
  • Hive
  • multi-dimensional index
  • performance evaluation

Fingerprint

Dive into the research topics of 'Performance Evaluation and Optimization of Multi-Dimensional Indexes in Hive'. Together they form a unique fingerprint.

Cite this