What is Apache Hive? Hive is an Apache open-source project built on top of Hadoop for querying, summarizing, and analyzing large data sets using a SQL-like interface. It’s noted for bringing the familiarity of relational technology to big data processing with its Hive Query Language, as well as structures and operations comparable to those used by relational databases such as tables, joins, and partitions.

Apache Hive is used mostly for batch processing of large ETL jobs and batch SQL queries on very large data sets.

Apache Hive

Try Qubole for Hive

How does Hive fit into the QDS Landscape?

QDS gives you the freedom to work with Hive, Hadoop MapReduce, Spark, and Presto as part of one unified interface with unified metadata. Choose the right solution for the right workload rather than being locked into any single technology. Hive and MapReduce are tried and proven for batch ETL and SQL workloads where reliability and stability are of the highest importance. In contrast, Spark is great for machine learning and other use cases that benefit from in-memory data and fast response time while Presto is a proven scalable SQL engine for simple, interactive analysis at companies such as Facebook, Netflix, Airbnb, and more.

A self-managing and self-optimizing implementation of Hive with Qubole

Runs on your choice of popular public cloud infrastructure

Leverages the platform’s AIR (Alerts, Insights, Recommendations) capabilities to help data teams focus on the outcome, instead of the platform

Agent technology augments original Hive with a self-managing and self-optimizing platform:
Cloud-optimized for faster workload performance

  • Smarter object storage access for split computation, batching of writes, pre-fetching, and multiple caching layers, SSD Caching
  • Use of Yarn as resource manager allows Hive metastore to be used across engines (Spark, Presto, Hive)

Easier to integrate with existing data sources and tools

  • ODBC/JDBC drivers
  • Database connectors (MySQL, SQL Server, Oracle DB, RDS, Redshift, Kinesis, and many others)
  • Comprehensive dictionary of REST APIs for application integration

Best-in-class security

  • HDFS and SSL encryption
  • SAML Authentication
  • VPC support
  • Dual IAM roles

QDS for AWS

QDS for Azure

QDS for Oracle Cloud

Try Qubole for Hive

When should I use QDS for Hive?

Hive is used mostly for batch processing of large ETL jobs and batch SQL queries on very large data sets.

Batch Processing for Extract, Transform, and Load (ETL)

One of the major benefits of Hive is the ability to extract, transform and load (ETL) large datasets in Hadoop rather than writing complex MapReduce programs. Technical users can easily execute batch ETL jobs to transform unstructured and semi-structured data into usable schema-based data. Hive is well suited for ETL with its mapping tools and a Hive Metastore that makes metadata for Hive tables and partitions easily accessible.

Batch SQL Queries

Hive is designed for batch queries on very large data sets (petabytes of data and beyond). Data analysts run SQL-like queries against data stored in Hive tables to turn the data into business insight. The Hive Metastore contains schemas and statistics which are useful in data exploration, query optimization, and query compilation.
Often, when traditional data sources can’t handle the processing of large SQL queries, users can import data into Hive and then run their queries there.

We collect events from our various systems via a Flume pipeline that writes data out to Amazon S3. From there, we use a data processing pipeline hosted by Qubole to process and aggregate statistics to Hive (computing) tables and to an AWS Redshift based data warehouse. For easy access to the data for the entire company, we use Tableau to navigate through our tables and produce visualizations.

Prakash Janakiraman, Co-Founder and VP Engineering at NextDoor