as a Service
What is Apache Hive? Hive is an Apache open-source project built on top of Hadoop for querying, summarizing and analyzing large data sets using a SQL-like interface. It’s noted for bringing the familiarity of relational technology to big data processing with its Hive Query Language, as well as structures and operations comparable to those used by relational databases such as tables, joins and partitions.
Apache Hive is used mostly for batch processing of large ETL jobs and batch SQL queries on very large data sets.
How does Hive fit into the QDS Landscape?
QDS gives you the freedom to work with Hive, Hadoop MapReduce, Spark, and Presto as part of one unified interface with unified metadata. Choose the right solution for the right workload rather than being locked into any single technology. Hive and MapReduce are tried and proven for batch ETL and SQL workloads where reliability and stability are of highest importance. In contrast, Spark is great for machine learning and other use cases that benefit from in-memory data and fast response time while Presto is a proven scalable SQL engine for simple, interactive analysis at companies such as Facebook, Netflix, Airbnb, and more.
A self-managing and self-optimizing implementation of Hive with Qubole
Runs on your choice of popular public Cloud infrastructure
Leverages the platform’s AIR (Alerts, Insights, Recommendations) capabilities to help data teams focus on outcome, instead of the platform
- Smarter object storage access for split computation, batching of writes, pre-fetching, and multiple caching layers, SSD Caching
- Use of Yarn as resource manager allows Hive metastore to be used across engines (Spark, Presto, Hive)
- ODBC/JDBC drivers
- Database connectors (MySQL, SQL Server, Oracle DB, RDS, Redshift, Kinesis and many others)
- Comprehensive dictionary of REST APIs for application integration
- HDFS and SSL encryption
- SAML Authentication
- VPC support
- Dual IAM roles
When should I use QDS for Hive?
Hive is used mostly for batch processing of large ETL jobs and batch SQL queries on very large data sets.
Batch Processing for Extract, Transform and Load (ETL)
One of the major benefits of Hive is the ability to extract, transform and load (ETL) large datasets in Hadoop rather than writing complex MapReduce programs. Technical users can easily execute batch ETL jobs to transform unstructured and semi-structured data into usable schema-based data. Hive is well suited for ETL with its mapping tools and a Hive Metastore that makes metadata for Hive tables and partitions easily accessible.
Batch SQL Queries
Hive is designed for batch queries on very large data sets (petabytes of data and beyond). Data analysts run SQL-like queries against data stored in Hive tables to turn the data into business insight. The Hive Metastore contains schemas and statistics which are useful in data exploration, query optimization, and query compilation.
Often, when traditional data sources can’t handle the processing of large SQL queries, users can import data into Hive and then run their queries there.