Blog

Qubole Security Model for Authorization in AWS Cloud

October 17, 2017 by and Updated November 29th, 2018

Qubole has defined a new security model to improve enterprise-level security and data governance on the cloud. The model integrates the cloud vendor’s storage authorization with Hive authorization. This improves usability for both cloud-storage administrators and data administrators (DBA), while eliminating errors that arise from end-user authorization problems.

This is an important milestone on the way to Qubole’s goal of building a secure, enterprise-level cloud platform. Qubole is one of the first vendors to add cloud storage-level checks at query compile time, and consequently offers the most secure platform in the cloud.

History of Database Security Model

In traditional databases (RDBMS and NoSQL), the database had complete control over the catalog, compute and storage, so administrators and users used the database as the single source of truth for authentication and authorization.

In early versions of Apache Hive, the catalog (Hive Metastore), compute (M/R) and storage (HDFS) were separate. Seamless authorization was not possible because each system had to administered independently. Apache Hive released SQL Standard Based Hive Authorization in Hive 0.13, which integrated all the systems. This mechanism runs storage-level checks through the Metastore server and checks the GRANT tables during query compile. As a prerequisite, a user must be registered with Hive and HDFS, and both must have a single view of users and their credentials. Similarly, Apache Ranger and Apache Sentry both provide HDFS-level file permission checks when the same user is trying to access tables through SQL.

Cloud Security Model

In the case of public clouds, data is stored in the cloud, and not in HDFS. Authorization on cloud storage is managed by the cloud vendor. SQL Standard Based Hive Authorization in Hive does not work in this case because there is no unified view of users and credentials.

Cloud Data Access Control

Consider the typical representation of a Hive table in the cloud depicted the diagram above:

A Hive table consists of files in the cloud storage, and catalog information in the Hive Metastore. Consequently, users have to be assigned Roles/Keys for cloud storage as well as USER or ROLE with the Hive database. The authorization modules of the cloud storage and the database are separate, and it is very hard to keep them synchronized. A mapping or coordination is needed to manage the pairs of roles and use them effectively.

Qubole Security Model

Qubole has integrated cloud storage authorization with Hive authorization by introducing a new security model. In QDS, users can choose between two models:

L1: User uses cloud storage permissions to control table access. User may also combine SQL statements to define access policy and rules, but will mainly use such statements for error-proofing (e.g. in the case of  a user accidentally dropping a table).

We’ll discuss the two models in detail below, using AWS S3 storage as an example.

L1: Cloud Storage Authorization(AWS)

If the organization has 2 tables:

TableUser AccessS3 location
AUser1s3://org/datawarehouse/tables/A
BUser2s3://org/datawarehouse/tables/B

What administrator needs to define in User1 and User2’s IAM policy:

For user1’s IAM-role:

...

{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::org/datawarehouse/tables/A/*"
},

...

For user2’s IAM-role:

...

{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::org/datawarehouse/tables/B/*"
},

...

Some issues with L1:

Fine-Grained Data Access ControlCannot support because users get access to the entire set of files
Storage Administrator MaintenanceNeed to create IAM roles per user or user group to define data directory access according to table access

 

In the case of any directory restructure, need to update the IAM policies

AuditingAt file level, which could be verbose (imagine a join across multiple tables, multiple partitions)
DBAEither do nothing or redo authorization policies via SQL statement as error-proofing

L2: Qubole SQL Authorization

L2’s design principle is to provide Database Administrators with a unified view of database and storage permissions. No co-ordination with cloud storage administrators is needed.

Cloud Compute Nodes

Qubole Hive implements this design principle by assigning a separate role to the compute nodes (IAM-C). This role provides access to all the data.

Qubole Hive then executes the following checks query compile time:

  • GRANT tables in metastore
  • Storage-level checks with the user’s IAM role (IAM-A) as needed for location-related DDLs only.

The compute role and query compile-time checks provide a seamless way for database administrators to define authorization policies without needing to make  any changes in cloud storage policies.

Advantages L2 vs. L1:

Fine-Grained Data Access ControlAchievable through Hive Views
Storage Administrator MaintenanceNo need to define data directory access according to table access in each IAM role setup

 

No work for any data directory restructure in the protected data warehouse bucket or folder.

AuditingAt table level through Hive logs
DBAUsing Standard SQL to control user access

L2: Limitations

The L2 model  covers data access only through metadata policies in QDS.

Qubole recommends customers also secure their cluster access outside of Qubole (e.g. direct access to clusters or storage) to build a completely secure environment.

Use Case

A big organization has three departments:  Sales, Marketing and Finance. These   need to have different privileges for access to tables stored in the protected data area.

Tables
customerstore_salespromotion
SalesRead/WriteRead/WriteRead
MarketingReadReadRead/Write
FinanceReadReadRead

Storage Configuration

3 tables are stored in S3 data warehouse which is a protected area
Protected Data Area:
s3a://datalake/datawarehouse/customer/          (customer table)
s3a://datalake/datawarehouse/store_sales/       (store_sales table)
s3a://datalake/datawarehouse/promotion/         (promotion table)

IAM Role Creation

IAM-Cluster: has full access to                                         s3a://datalake/datawarehouse
IAM-AccountDefault: has access only to                     s3a://datalake/defloc/
if admin wants to add a temp folder for everyone:          s3a://datalake/temp

Then follow Dual-IAM Role Documentation to set up Dual-IAM role for the account

QDS Account Setup

All users can belong to the same QDS account.  Here’s a user list that the administrator needs to grant permission to.

User Group / DepartmentUser
Finance[email protected]
Marketing[email protected]
Sales[email protected]

On the Account Settings page in the QDS UI, put IAM-AccountDefault role credentials in the setting page.

Access Mode SS

HiveQL DDL Privilege Setup

After Hive Authorization is enabled for the account.

Set role admin;

Use demo_database;

 

Create role sales;

Create role marketing;

Create role finance;

 

Grant select, insert on customer to role sales;

Grant select on customer to role finance;

Grant select on customer to role marketing;

 

Grant select, insert on store_sales to role sales;

Grant select on store_sales to role finance;

Grant select on store_sales to role marketing;

 

Grant select on promotion to role sales;

Grant select on promotion to role finance;

Grant select, insert on promotion to role marketing

 

GRANT role finance to USER john;

GRANT role sales to USER david;

GRANT role marketing to USER mary;

Future Work

Other engines (Spark, Presto, etc.) and applications (e.g. Zeppelin Notebook)  will support the L2 Model over the next couple releases. We’ll write blogs to elaborate on the mechanism and provide use cases for these engines to demonstrate our cross-engine security solution for data authorization.