Blog

 
 
 

CUBE Keyword in Apache Hive

  • By Rajat Venkatesh
  • June 19, 2015
 

Introduction

As part of a recent project – I had to experiment with CUBE functionality in Hive. This functionality was added somewhat recently to Hive (version 0.10) and is an advanced use case in Hive. Perhaps for these reasons – it is difficult to find examples other than the one in the Hive Wiki. In this post – I am documenting some of my experiments in setting up a CUBE on TPCDS. I hope that this is useful for other users new to Hive and/or Cubes in Hive.

Data Model

For my experiments I used 500 GB scale TPCDS data set, a 10 node Hadoop cluster and Hive 0.13.1 running on Qubole Data Service (QDS).

My goal was to calculate various measures on store sales. More specifically, I wanted to calculate:

  • Total Extended Price
  • Total Sales Price
  • Total Net Profit
  • Total Wholesale Cost
  • Total Coupon Amt
  • Total List Price

These measures then need to be broken by many dimensions. For example we can drill down in the following dimensions (Levels in parentheses):

  • Date (Year, Quarter, Month, Day)
  • Store Information (Store Id)
  • Household Demographics (Number of Dependents, Buy Potential)
  • Customer Demographics (Gender, Marital Status, Education Status)
  • Ad Channel (TV, Event, Email)
  • Time (Hour, Minute)

The ER Diagram for the relevant tables is shown below. It’s a classic star schema.

Cube-Rollup-Example-crop-redux

Introduction to Cubes

This example is a typical dimensional data model found in OLAP. The data model describes the measures and the dimensions that make the data useful. Cubes are the physical implementations of dimensional data model. A cube captures the structure in the data model and organizes measures and dimensions in an optimal layout. Queries on cubes are highly efficient and can support online applications and dashboards.

Build the Cube

Preprocess the data
First, I filtered the store_sales table to contain data from 2002 onwards to keep execution times reasonable for my experiments.


# 2452276 is the id in date_dim for the row of Jan 1 2002
create table store_sales_2002_plus as select * from tpcds_orc_500.store_sales where ss_sold_date_sk >= 2452276

select count(*) from store_sales_2002_plus;
278035965

Create Cube

I created a cube to store dimensions and measures I am interested in.


create table store_sales_cube as select sum(ss_ext_sales_price) as sum_extended_price, 
             sum(ss_sales_price) as sum_sales_price, sum(ss_net_profit) as sum_net_profit,
             sum(ss_wholesale_cost) as sum_wholesale_cost, sum(ss_coupon_amt) as sum_coupon_amt,
             sum(ss_list_price) as sum_list_price, 
             d_year, d_qoy, d_moy, d_date, s_store_id,
             cd_gender, cd_marital_status, cd_education_status, grouping__id
      from store_sales_2002_plus join item on ss_item_sk = i_item_sk 
             join customer on ss_customer_sk = c_customer_sk 
             join date_dim on ss_sold_date_sk = d_date_sk 
             join customer_demographics on ss_cdemo_sk = cd_demo_sk 
             join promotion on ss_promo_sk = p_promo_sk 
             join household_demographics on ss_hdemo_sk = hd_demo_sk 
             join store on ss_store_sk = s_store_sk 
             join time_dim on ss_sold_time_sk = t_time_sk
        group by d_year, d_qoy, d_moy, d_date, s_store_id,
             cd_gender, cd_marital_status, cd_education_status
        with cube;

select count(*) from store_sales_cube;
1586304

The above query generates aggregates for all possible combinations of group by columns.
Schema of store_sales_cube is:

Column Data Type
sum_extended_price double
sum_sales_price double
sum_net_profit double
sum_wholesale_cost double
sum_coupon_amt double
sum_list_price double
d_year int
d_qoy int
d_moy int
d_date timestamp
s_store_id string
cd_gender string
cd_marital_status string
cd_education_status string
grouping__id string

A few example rows are shown below.

d_year d_qoy d_moy d_date s_store_id s_store_name cd_gender cd_marital_status cd_educational_status grouping__id Total Sales Price
2002 3 NULL NULL NULL NULL NULL NULL NULL 3 2.712…E9
NULL NULL NULL NULL NULL NULL NULL NULL NULL 0 1.00…E10
NULL NULL NULL NULL NULL NULL M NULL NULL 64 5.01..E9

The first row stores measures for d_year=2002, d_qoy=3 only. It is one of the rows in the result of


select d_year, d_qoy, sum(ss_sales_price), other aggregates
from store_sales_2002_plus
join date_dim on s_sold_date_sk = d_date_sk
group by d_year, d_qoy;

The second row stores the measures for the complete data set.

The third row stores the measures for cd_gender=M only. It is one of the rows in the result of


select cd_gender, sum(ss_sales_price), other aggregates
from store_sales_2002_plus
join customer_demographics on ss_cdemo_sk = cd_demo_sk 
group by cd_gender;

Grouping ID

Let’s say an analyst is interested in finding sum_sales_price by gender (cd_gender). How does the analyst find the rows that store the measures for cd_gender ?
grouping_id is useful to select rows based on the dimensions of interest. grouping_id is a column generated by Hive when CUBE keyword is used. I specified it in the project list to use it in subsequent queries. Grouping ID is a bit vector of the dimensions in a cube and is stored as a base10 integer. It is generated by listing the dimensions from right to left in the same order as the group by column in the cube create SQL. Bit 1 is assigned to the dimension that occurs in a row.
For rows that have measures for cd_gender, the bit vector is 001000000. The table below has a couple of more examples.

Group By Columns grouping_id cd_educational_status cd_marital_status cd_gender s_store_name s_store_id d_date d_moy d_qoy d_year
cd_gender 64 0 0 1 0 0 0 0 0 0
d_year, d_qoy 3 0 0 0 0 0 0 0 1 1
d_year,d_moy 5 0 0 0 0 0 0 1 0 1
d_year, d_qoy, cd_marital_status 131 0 1 0 0 0 0 0 1 1

Lets look at an example of rows for cd_gender dimension.


select cd_gender, sum_sales_price
   from store_sales_cube where 
   `grouping__id` = conv("001000000", 2, 10);
cd_gender total_sales_price
M 5.017321159230397E9
F 5.01904028465792E9

conv is a Hive function to convert a number (specified in a string) in a specified base (in this case 2) to an integer in another base (in this case 10). It takes the string, the base of the number in the string and the base of the result as arguments.

Total Sales Price for each quarter

Let us look at queries on the raw data and cube to calculate the measures. I will use the number of rows read as a measure of speed.


Query on raw data:
select d_year, d_qoy, sum(ss_sales_price)
from store_sales_2002_plus
join date_dim on s_sold_date_sk = d_date_sk
group by d_year, d_qoy;

Bytes Read: 776,170,833

Query on cube:
select d_year, d_qoy, sum(sum_sales_price)
   from store_sales_cube where 
   `grouping__id` = conv("000000011", 2, 10);
Bytes Read: 11,783,322

Query on the cube scanned 1.5% of the data compared to the query on raw data tables.

Total Sales for each quarter to married customers

Lets look at another example. The following query filters the results by another dimension – cd_marital_status


select d_year, d_qoy, cd_marital_status, sum_sales_price
   from store_sales_cube where 
   `grouping__id` = conv("010000011", 2, 10) and cd_marital_status = "M";

GROUPING Functions in other databases

Grouping functions is important to choose the right cells in a cube. Other databases have similar functions. For e.g. refer to GROUPING functions in Oracle or GROUPING_ID in SQL Server.

Summary

In summary, we looked at an example of multidimensional data generated by the CUBE keyword in Apache Hive. We also understood how to use GROUPING_ID to select the right cells in a cube.


Leave a Reply

Your email address will not be published. Required fields are marked *

Are you human? *

 
 
 
 
clear