Hive 101 will explore the basics of the Hive solution in including the Hive Metadata, Hive SQL Syntax and relevant MapReduce settings.
Qubole Education Hive 101 contains “Try It” sections which contain instructions that can be followed inside of a personal account if you have access to the default_qubole_airline_origin_destination demonstration table. Please contact your administrator if you have issues accessing the sample datasets from your personal accounts.
Qubole Education Hive 101 contains quiz questions after several of the lessons – it will be necessary to answer the quiz questions to complete the lessons and the course.
- User 101
Estimated Time: 60 to 75 minutes
Hive Intro1 of 3
Hive Intro Hive is a data warehouse software developed by Apache and designed to facilitate data access and manipulation in distributed storage systems. Hive was originally built on top of Apache Hadoop and performs query execution with MapReduce to take advantage of the scalability and data storage features of distributed file system. Hive features a [...]
Hive File Types2 of 3
Hive File Types Hive does not demand a particular file type therefore Qubole is able to read and write files of various structures including text-based, columnar and binary. A list of the common types is provided for reference with a few comments from the Qubole documentation. Hive works very well with columnar formats since these [...]
Hive Optimizations3 of 3
Hadoop Vectorization Hadoop supports vectorization which allows Hive to process batches of rows together with each batch consisting of a column vector, usually an array of primitive types. Operations are performed on the entire column vector improving the instruction pipelines and cache usage. Qubole can take advantage of this optimization to improve cluster performance for [...]
Qubole Hive Metadata1 of 4
Qubole Hive Metadata Qubole provides customers the ability to create Schema Tables for files in cloud storage inside of the Qubole Hive Metadata. The format of the associated file in cloud storage can be specified during creation of a table in Qubole Hive. With a schema in place users can write SQL queries inside of [...]
Hive File Partitioning2 of 4
Hive Partitioning Hive partitioning is an effective method to improve the query performance on larger tables for queries designed around the partitioning key. Since partitioning will store data in separate subdirectories under table location the selection of partition key is should be driven by the cardinality of the attribute. If there is a relationship across [...]
Hive SQL Syntax3 of 4
Hive SQL Syntax It is important to keep in mind that all queries are converted into Map Reduce jobs and SQL commands that run quickly in a relational database may have surprising runtimes inside of Hive. Hive only supports equivalency based joins therefore it may be necessary to move conditional statements which would normally reside [...]
Try It - Hive SQL Syntax4 of 4
Overview You will run two different SQL statements against the same table and evaluate the number of MapReduce jobs that are created in response to the SQL submitted. The SQL statements will be identical with the exception of the BY clause which will change across queries. You will then answer the associated quiz questions based on [...]
Hive MapReduce Settings1 of 4
Hive MapReduce Settings Qubole exposes a variety of configuration settings which, when modified, will change the behavior of Hive running on Hadoop. This flexibility gives users the ability to fine tune the number of mappers and reducers as well as the task and memory restrictions for specific queries and commands. hive.exec.reducers.bytes.per.reducers mapred.tasktracker.map.tasks.maximum hive.exec.reducers.max mapred.tasktracker.reduce.tasks.maximum Modifying [...]
Try It - Hive MapReduce Settings2 of 4
Overview You will run the same SQL statements against the same table and evaluate the number of Reducers that are created in response to the SQL submitted. The SQL statements will be identical with the exception of the MapReduce settings used. You will then answer the associated quiz questions based on the behavior observed. Code In the [...]
Qubole Hive Versions3 of 4
Qubole Hive Versions An account using Hadoop 2 can leverage Tez if Hive 1.2 is enabled. With Tez enabled map output can be stored in memory instead of on disk thereby improving runtime of queries with multiple map reduce jobs.
Hive Bootstrap & Connectors4 of 4
Hive Bootstrap When working with Hive there may be a need to add jar files, define temporary functions and set parameters or MapReduce settings. This can be accomplished through the addition of a Hive Bootstrap inside of Qubole which will be run before the query is submitted. If all queries need to leverage the same [...]