Apache Hive is an SQL-like software used with Hadoop to give users the capability of performing SQL-like queries on its language, HiveQL, quickly and efficiently. It also offers users additional query and analytical abilities, which are not available on traditional SQL structures.
With Apache Hive, users can use HiveQL or traditional MapReduce systems, depending on individual needs and preferences. Hive is particularly ideal for analyzing large datasets (petabytes) and includes various storage options.
Hive is full of unique tools that allow users to quickly and efficiently perform data queries and analysis. In order to make full use of all these tools, users need to use best practices for Hive implementation. If you’re wondering how to scale Apache Hive, here are ten ways to make the most of Hive performance.
- Partitioning Tables:
Hive partitioning is an effective method to improve the query performance on larger tables. Partitioning allows you to store data in separate sub-directories under table location. It dramatically helps the queries which are queried upon the partition key(s). Although the selection of partition key is always a prudent decision, it should always be a low cardinal attribute. For example, if your data is associated with the time dimension, then the date could be a good partition key. Similarly, if data is associated with location, like a country or state, it’s a good idea to have hierarchical partitions like country/state.
- De-normalizing data:
Normalization is a standard process used to model your data tables with certain rules to deal with a redundancy of data and anomalies. In simpler words, if you normalize your data sets, you end up creating multiple relational tables which can be joined at the run time to produce the results. Joins are expensive and complicated operations to perform and are common reasons for performance issues. Because of that, it’s a good idea to avoid highly normalized table structures because they require to join queries to derive the desired metrics.
- Compress map/reduce output:
Compression techniques significantly reduce the intermediate data volume, which internally minimizes the amount of data transfers between mappers and reducers. All this generally occurs over the network. Compression can be applied to the mapper and reducer output individually. Keep in mind that gzip-compressed files are not splittable. That means this should be applied with caution. Compressed file size should not be larger than a few hundred megabytes. Otherwise, it can potentially lead to an imbalanced job. Other options of compression codec could be snappy, lzo, bzip, etc.
- Map join:
Map joins are efficient if a table on the other side of a join is small enough to fit in the memory. Hive supports a parameter, hive.auto.convert.join, which suggests that Hive tries to map join automatically when it’s set to “true.” When using this parameter, be sure the auto-convert is enabled in the Hive environment.
Additionally, it’s essential to ensure the bucketing flag is set (SET hive.enforce.bucketing=true;) every time before writing data to the bucketed table. To leverage the bucketing in the join operation, we should SET hive.optimize.bucketmapjoin=true. This setting hints to Hive to do bucket level join during the map stage join. It also reduces the scan cycles to find a particular key because bucketing ensures that the key is present in a specific bucket.
- Input Format Selection:
Input formats play a critical role in Hive performance. For example, JSON, the text type of input format, is not the right choice for an extensive production system where data volume is high. These types of readable formats take a lot of space and have some parsing overhead (e.g. JSON parsing ). To address these problems, Hive comes with columnar input formats like RCFile, ORC, etc. Columnar formats allow you to reduce the read operations in analytics queries by allowing each column to be accessed individually. There are some other binary formats like Avro, sequence files, Thrift, and ProtoBuf, which can help in various use cases.
- Parallel execution:
Hadoop can execute MapReduce jobs in parallel, and several queries executed on Hive automatically use this parallelism. However, single, complex Hive queries commonly are translated to several MapReduce jobs that are executed by default sequencing. Some of a query’s MapReduce stages are often not interdependent and could be executed in parallel. They then can take advantage of spare capacity on a cluster and improve cluster utilization while at the same time reducing the overall query execution time. Hive’s configuration to change this behavior is merely switching a single flag SET hive.exec.parallel=true.
Vectorization allows Hive to process a batch of rows together instead of processing one row at a time. Each batch consists of a column vector which is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. To enable vectorization, set this configuration parameter SET hive.vectorized.execution.enabled=true.
- Unit Testing:
Merely speaking, unit testing determines whether the smallest testable piece of your code works exactly as you expect. Unit testing gives a couple of benefits, i.e., detecting problems early, making it easier to change and refactor code, being a form of documentation that explains how code works, to name a few.
In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries, and more. To a large extent, it is possible to verify your whole HiveQL query’s correctness by running quick local unit tests without even touching a Hadoop cluster. Because executing HiveQL query in the local mode takes literally seconds, compared to minutes, hours or days if it runs in the Hadoop mode, it certainly saves enormous amounts of development time.
Sampling allows users to take a subset of datasets and analyze it without analyzing the entire data set. If a representative sample is used, then a query can return meaningful results and finish quicker and consume fewer compute resources.Hive offers a built-in TABLESAMPLE clause that allows you to sample your tables. TABLESAMPLE can sample at various granularity levels – it can return only subsets of buckets (bucket sampling), or HDFS blocks (block sampling), or only first N records from each input split. Alternatively, you can implement your own UDF that filters out records according to your sampling algorithm.
For more tips on how to perform efficient Hive queries, see this blog post.