Qubole strives to provide its customers with the best of both worlds: new enhancements and features in open source software releases, and the latest value-added capabilities in the Qubole data platform. So we are pleased to announce that Qubole is the first and only vendor to deliver Hive 3.1.1 in the cloud.
The Big Deal: Performance Improvements
Hive 3.1.1 has significant performance improvements over Hive 2.3.4. We have executed performance comparison of these two versions using the TPC-DS benchmark at 1000 scale, with data on Amazon S3.
The entire performance testing was run on a 5-node r3.4xLarge cluster. Our labs ran all the supported queries in the benchmark and observed that the overall TPC-DS workload with Hive 3.1.1 ran 1.8 times faster than Hive 2.3.4, as presented in the chart below.
In a recently updated blog, we demonstrated how autoscaling—a core capability of Qubole—delivers significant cost savings for Qubole customers. Recently we have also introduced Reducer-based autoscaling for Hadoop MRv2 clusters, plus other ways to autoscale (more details are available here).
Direct Writes in Cloud Storage
Hive in Qubole Data Service (QDS) improves overall query performance in cloud storage I/O by eliminating the data copy steps required with open-source Hive.
Amazon S3 Optimizations
Hive in QDS comes with specialized performance optimizations for Amazon S3. This is accomplished through improved split computation in Hive by optimizing Amazon S3 bulk listing APIs (more details can be found here), and refined implementation on the open source S3 read logic.
A top priority for Qubole is to continuously improve enterprise security and governance on the cloud. Hive in QDS has integrated cloud storage authorization with Hive authorization by introducing a new security model (more details are available here).
Apache Ranger Support in Hive
Qubole provides support for Apache Ranger with Hive to deliver fine-grained data access control, including row-level filtering and column-level masking (further details are available here).
Automatic Statistics Collection
Qubole offers a managed Automatic Statistics Collection service, which allows you to keep fresh table statistics and benefit from them in query planning and execution for better performance (more details can be found here).
Qubole provides a highly available, industry leading solution for running HiveServer2 at a large scale, which eliminates the need for complex memory configurations on the single master node, resulting in much greater stability, scalability, and performance when running enterprise workloads (more details can be found here).
Getting Started with Hive 3.1.1 on Qubole
You can launch a cluster with Hive 3.1.1 just like any other supported Hive versions in Qubole.
Please contact Qubole Support if you do not have a 3.1.1 cluster available yet in your Cluster Hive Version options. This UI change is being rolled out in phases over the next few weeks (see image below).
Refer to the Hive 3.1.1 prerequisites in the corresponding product documentation page.
Upgrading Your Existing Hive Metastore
In order to use Hive 3.1.1, you must upgrade your metastore.
- If you are using a QDS-managed metastore, Qubole can upgrade the metastore for you, while ensuring backward compatibility (lower versions of Hive or HMS still can talk to Hive 3.1.1 metastore). Please raise a ticket with Qubole Support to request an upgrade to a Hive 3.1.1 metastore.
- If you are using a self-managed Hive metastore, note that the open source Apache Hive 3.1.1 metastore is incompatible with previous versions of Hive. Qubole has fixed this issue in the open-source version (see HIVE-21739 and HIVE-21821), which will be included in Apache Hive 3.1.2.
Please contact Qubole Support for a script that you can run on the upgraded metastore to make it backward compatible.
Unsupported and deprecated features
The following features provided by Qubole in earlier Hive versions are either unsupported or deprecated in Hive 3.1.1:
- Qubole Table copy : Unsupported
- Qubole TMP Tables : Unsupported
- AWS Glue : Unsupported (but planned in the near future)
- OpenX JSON SerDe: Unsupported
- MapReduce: Deprecated. Tez is the default
- Running Hive queries on the master node without HS2 (Hive on Master): Unsupported
- Running Hive queries on QDS hosted servers: Unsupported
We encourage you to start using Hive 3.1.1 right away to benefit from the performance enhancements it delivers. If you don’t yet have a Qubole environment, you can try Hive 3.1.1 by signing up for a free 14-day Qubole free trial.