# CUBE Keyword in Apache Hive

June 19, 2015 Updated September 11th, 2021

### Introduction

As part of a recent project – I had to experiment with CUBE functionality in Hive. This functionality was added somewhat recently to Hive (version 0.10) and is an advanced use case in Hive. Perhaps for these reasons – it is difficult to find examples other than the one in the Hive Wiki. In this post – I am documenting some of my experiments in setting up a CUBE on TPC-DS. I hope that this is useful for other users new to Hive and/or Cubes in Hive.

### Data Model

For my experiments, I used a 500 GB scale TPCDS data set, a 10 node Hadoop cluster, and Hive 0.13.1 running on Qubole Data Service (QDS).

My goal was to calculate various measures on store sales. More specifically, I wanted to calculate:

• Total Extended Price
• Total Sales Price
• Total Net Profit
• Total Wholesale Cost
• Total Coupon Amt
• Total List Price

These measures then need to be broken into many dimensions. For example, we can drill down in the following dimensions (Levels in parentheses):

• Date (Year, Quarter, Month, Day)
• Store Information (Store Id)
• Household Demographics (Number of Dependents, Buy Potential)
• Customer Demographics (Gender, Marital Status, Education Status)
• Ad Channel (TV, Event, Email)
• Time (Hour, Minute)

The ER Diagram for the relevant tables is shown below. It’s a classic star schema.

### Introduction to Cubes

This example is a typical dimensional data model found in OLAP. The data model describes the measures and the dimensions that make the data useful. Cubes are the physical implementations of dimensional data models. A cube captures the structure in the data model and organizes measures and dimensions in an optimal layout. Queries on cubes are highly efficient and can support online applications and dashboards.

### Build the Cube

Preprocess the data
First, I filtered the store_sales table to contain data from 2002 onwards to keep execution times reasonable for my experiments.

```
# 2452276 is the id in date_dim for the row of Jan 1 2002
create table store_sales_2002_plus as select * from tpcds_orc_500.store_sales where ss_sold_date_sk >= 2452276

select count(*) from store_sales_2002_plus;
278035965
```

Create Cube

I created a cube to store the dimensions and measures I am interested in.

```
create table store_sales_cube as select sum(ss_ext_sales_price) as sum_extended_price,
sum(ss_sales_price) as sum_sales_price, sum(ss_net_profit) as sum_net_profit,
sum(ss_wholesale_cost) as sum_wholesale_cost, sum(ss_coupon_amt) as sum_coupon_amt,
sum(ss_list_price) as sum_list_price,
d_year, d_qoy, d_moy, d_date, s_store_id,
cd_gender, cd_marital_status, cd_education_status, grouping__id
from store_sales_2002_plus join item on ss_item_sk = i_item_sk
join customer on ss_customer_sk = c_customer_sk
join date_dim on ss_sold_date_sk = d_date_sk
join customer_demographics on ss_cdemo_sk = cd_demo_sk
join promotion on ss_promo_sk = p_promo_sk
join household_demographics on ss_hdemo_sk = hd_demo_sk
join store on ss_store_sk = s_store_sk
join time_dim on ss_sold_time_sk = t_time_sk
group by d_year, d_qoy, d_moy, d_date, s_store_id,
cd_gender, cd_marital_status, cd_education_status
with cube;

select count(*) from store_sales_cube;
1586304
```

The above query generates aggregates for all possible combinations of groups by columns.
Schema of store_sales_cube is:

ColumnData Type
sum_extended_pricedouble
sum_sales_pricedouble
sum_net_profitdouble
sum_wholesale_costdouble
sum_coupon_amtdouble
sum_list_pricedouble
d_yearint
d_qoyint
d_moyint
d_datetimestamp
s_store_idstring
cd_genderstring
cd_marital_statusstring
cd_education_statusstring
grouping__idstring

A few example rows are shown below.

d_yeard_qoyd_moyd_dates_store_ids_store_namecd_gendercd_marital_statuscd_educational_statusgrouping__idTotal Sales Price
20023NULLNULLNULLNULLNULLNULLNULL32.712...E9
NULLNULLNULLNULLNULLNULLNULLNULLNULL01.00...E10
NULLNULLNULLNULLNULLNULLMNULLNULL645.01..E9

The first-row stores’ measures for d_year=2002, d_qoy=3 only. It is one of the rows in the result of

```
select d_year, d_qoy, sum(ss_sales_price), other aggregates
from store_sales_2002_plus
join date_dim on s_sold_date_sk = d_date_sk
group by d_year, d_qoy;
```

The second-row stores the measures for the complete data set.

The third-row stores the measures for cd_gender=M only. It is one of the rows in the result of

```
select cd_gender, sum(ss_sales_price), other aggregates
from store_sales_2002_plus
join customer_demographics on ss_cdemo_sk = cd_demo_sk
group by cd_gender;
```

### Grouping ID

Let’s say an analyst is interested in finding sum_sales_price by gender (cd_gender). How does the analyst find the rows that store the measures for cd_gender?
grouping_id is useful to select rows based on the dimensions of interest. grouping_id is a column generated by Hive when the CUBE keyword is used. I specified it in the project list to use in subsequent queries. Grouping ID is a bit vector of the dimensions in a cube and is stored as a base10 integer. It is generated by listing the dimensions from right to left in the same order as the group by column in the cube create SQL. Bit 1 is assigned to the dimension that occurs in a row.
For rows that have measures for cd_gender, the bit vector is 001000000. The table below has a couple of more examples.

Group By Columnsgrouping_idcd_educational_statuscd_marital_statuscd_genders_store_names_store_idd_dated_moyd_qoyd_year
cd_gender64001000000
d_year - d_qoy3000000011
d_year - d_moy5000000101
d_year - d_qoy - cd_marital_status131010000011

Let’s look at an example of rows for the cd_gender dimension.

```
select cd_gender, sum_sales_price
from store_sales_cube where
`grouping__id` = conv("001000000", 2, 10);
```

cd_gendertotal_sales_price
M5.017321159230397E9
F5.01904028465792E9

Conv is a Hive function to convert a number (specified in a string) in a specified base (in this case 2) to an integer in another base (in this case 10). It takes the string, the base of the number in the string, and the base of the result as arguments.

### Total Sales Price for each quarter

Let us look at queries on the raw data and cube to calculate the measures. I will use the number of rows read as a measure of speed.

```
Query on raw data:
select d_year, d_qoy, sum(ss_sales_price)
from store_sales_2002_plus
join date_dim on s_sold_date_sk = d_date_sk
group by d_year, d_qoy;

Query on cube:
select d_year, d_qoy, sum(sum_sales_price)
from store_sales_cube where
`grouping__id` = conv("000000011", 2, 10);
```

Query on the cube scanned 1.5% of the data compared to the query on raw data tables.

### Total Sales for each quarter to married customers

Let’s look at another example. The following query filters the results by another dimension – cd_marital_status

```
select d_year, d_qoy, cd_marital_status, sum_sales_price
from store_sales_cube where
`grouping__id` = conv("010000011", 2, 10) and cd_marital_status = "M";
```

### GROUPING Functions in other databases

Grouping functions is important to choose the right cells in a cube. Other databases have similar functions. For e.g. refer to GROUPING functions in Oracle or GROUPING_ID in SQL Server.

### Summary

In summary, we looked at an example of multidimensional data generated by the CUBE keyword in Apache Hive. We also understood how to use GROUPING_ID to select the right cells in a cube.

• ### Blog Subscription

Get the latest updates on all things big data.

### QUBOLE LIVE DEMO: Google Cloud Platform (GCP) Enables You To Simplify Today and Future Proof for Tomorrow

Jan. 27, 2022 | Global

### Data Lake and Data Warehouse – A modern data strategy discussion

Feb. 2, 2022 | Online

### QUBOLE LIVE DEMO: Stop The Cloud Cost Madness With Graviton and AWS. Switch And Save to Reduce Your Data Lake Costs Today

Feb. 3, 2022 | Global

### CONTINUOUS INTELLIGENCE DAY – Continuous Intelligence in Finance 2022 and beyond

Feb. 24, 2022 | Global

### Data Innovation Summit MEA 2022

Mar. 7, 2022 | Global

### Data2030 Summit 2022 – APAC Edition – Data Strategies For Data And AI-Driven Organisations

May. 24, 2022 | Global