APACHE HIVE

What is Apache Hive? 

Hive is an Apache open-source project built for querying, summarizing, and analyzing large data sets using a SQL-like interface. It is noted for bringing the familiarity of relational technology to big data processing with its Hive Query Language, as well as structures and operations comparable to those used with relational databases such as tables, JOINs, and partitions.

Apache Hive is particularly good for analyzing large data sets with complex JOIN conditions. For example, batch SQL processing; exploratory queries on large volumes of data; queries that could be interrupted and need to be resumed, among others.

Apache Hive

Want to learn more about Hive and Qubole?

HIVE IN BIG DATA

Qubole has provided a managed Hive service since 2013, with multiple Hive versions and regular upgrade cadence. HIve on Qubole was designed with cloud optimizations since the beginning and tailored to the needs of organizations that are either migrating to or already have a cloud data lake deployed.

Qubole blends the latest features from the open-source community with Qubole’s proprietary solutions to boost performance, reduce costs, improve user experience, and simplify administration and management.

KEY BENEFITS OF APACHE HIVE ON QUBOLE

Fast Time to Value

  • Guided steps to create Hive clusters in minutes
  • Multiple interfaces to access data via UIs, APIs, and drivers

Cost Efficiency

  • Reduce overall data processing costs by up to 50% compared to self-managed infrastructures



Productivity with Improved Performance

  • Curated table metadata management
  • Performance optimization with cloud storage for faster query processing

Enterprise-Ready

  • Enterprise-grade security
  • JDBC/ODBC connectors integrated with mainstream BI tools

APACHE HIVE ON QUBOLE

Hive Autoscaling

QuboleOpen Source
Workload-aware autoscaling, for adapting to variability and burstiness of workloads
Multiple HiveServer2 Instances to accommodate burst traffic and increase the throughput of the service.

 

Hive Performance

QuboleOpen Source
Direct writes eliminate slower file copy operations in cloud storage
Faster cloud storage I/O
Metadata caching
Automatic statistics collection and management for better query planning and execution

Hive Cost Optimization

QuboleOpen Source
Automated Cluster Lifecycle Management
Heterogeneous instances to leverage price differences from other instance families, while keeping clusters at peak efficiency
Container Packing and Aggressive Downscaling when cluster only has light usage
Specialized support for cost-optimal scaling

Hive Security and Compliance

QuboleOpen Source
SQL-standards based Hive Authorization and Apache Ranger Support
ACID transactions support
Compliance (HIPAA, SOC2, ISO-27001)

Resources

BLOG
Hive on Qubole runs 4x faster than Hive on AWS EMR
DOC
Hive Cheat Sheet