• Home... >
  • Blog >
  • Share Data Across Accounts with Data Exchange

Blog

 
 
 

Share Data Across Accounts with Data Exchange

  • By Xing Quan
  • November 4, 2015
 

This post was written by Vikram Agrawal and Aswin Anand, who are both lead engineers at Qubole.

Qubole has the concept of users and accounts. While customers sign in as a single user, they can also belong to one or more accounts. This account segregation provides some nice logical separation for compute clusters and metadata. By default, a cluster started by Qubole is shared across all the users in the account. Similarly, the metastore that describes the databases, tables, and underlying table structure is shared across all the users in the account and is accessible by all the clusters in the account.

We’ve heard requests from our customers who want to share their data across Qubole accounts. In some cases, customers are okay with making a full copy of the data available across accounts. In these cases the movement of data can be complex, timely, and expensive: it’s not easy to copy petabytes of data. In other cases, customers want to maintain ownership of the data and simply delegate read access to the data.

 

Data Exchange on AWS

We’re excited to announce the availability of Data Exchange on AWS for easily sharing your data across Qubole accounts! Data Exchange can be used for some of the following use cases:

  • Publishing a specific data set to a subscriber so they can run self-service analytics. This is common among advertising technology companies where a publisher will make the full results of an advertising campaign available to a customer. The customer can then do analyses such as audience correlation and lookalike modeling using Qubole’s suite of analytics offerings, including Hive, Presto, Spark, Pig, and Hadoop MapReduce.
  • One-off data sharing across accounts within the same company or group. Our customers will sometimes open multiple accounts for administration purposes across departments. This is helpful for accounting and cost allocation reasons (i.e. they can easily identify how much compute resources each department is using) as well as logical data isolation. Data Exchange makes this easy while still providing allowances for one-off sharing of specific tables or databases across departments.

The basic elements of Data Exchange are Spaces. A Space is a logical container for storing Hive tables that are meant to be shared. For the rest of this blog post, we’ll refer to a publisher as the account that is sharing data and a subscriber as the account that consumes that data.

Data Exchange never moves or copies the underlying data. Rather, a copy of the metadata entry will be shared with the subscriber account. This keeps the ownership of the data within the publisher’s account. And, the data can be configured as read-only for the subscriber of the data.

 

Walkthrough For Publisher

Getting started is easy. First, you’ll need to create a Space. You can do this within the Explore section of our web interface and choosing Data Exchange within the dropdown list.

1

 

As the publisher, you’ll create a new space within My Spaces by hitting the + button. You’ll put in a unique name for the Space (these names are in a global namespace for all Qubole accounts) and an S3 location for storing some metadata that Qubole creates related to Data Exchange.

2

 

Next, you’ll export a Hive table to your Space. Within Explore, select Qubole Hive from the dropdown list. From here, select a database and then a table within that database. You’ll see the properties for that table. On the upper-right corner of this screen, you’ll see a gear icon for settings. From this list, select Export to Space.

3

 

Now, choose the Space that you recently created and hit the Export button.

4

 

AWS console steps for Publisher

Next, you’ll move over to the AWS console to configure some cross-account access for your S3 data. The subscriber needs read access to metadata saved within the space, and the data location of the shared hive table. As the publisher, you’ll need to set the following policy in your AWS account. Here, we’ve named the policy name cross-account-access-policy.

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Sid": "Stmt1444052180000",
           "Effect": "Allow",
           "Action": "s3:GetObject",
           "Resource": [                                                
               "arn:aws:s3:::data-exchange-blog-post-demo/path-to-space-s3-location/*"
               "arn:aws:s3:::bucket-1/path-to-hive-data/* "
           ]
       },
       {
           "Sid": "Stmt144405218001",
           "Effect": "Allow",
           "Action": [
               "s3:GetBucketLocation",
               "s3:ListBucket"
           ],
           "Resource": [
               "arn:aws:s3:::data-exchange-blog-post-demo*"
               "arn:aws:s3:::bucket-1*"
           ]
       }
   ]
}

 

Once the policy is in place, you’ll need to create a cross-account role that the subscriber can assume. We’ve named the example role name below role-for-subscriber.

On the role type, you’ll select the second option, “Allows IAM users from a 3rd party AWS account to access this account”.

7

 

Next, you’ll add in the 12-digit AWS account number for the subscriber and a descriptive external ID for easy identification.

8

 

Finally, you need to attach the policy we created earlier, cross-account-access-policy to the role we just created.

9

 

Now, you’re ready to share the relevant details with the subscriber. You’ll share the role_arn, which for this example is arn:aws:iam::<12-digit AWS account number>:role/role-for-subscriber. The 12-digit AWS account number is for the publisher’s account. You’ll also share the external ID, which we named subscriber1.

 

Walkthrough For Subscriber

As a subscriber (a different Qubole account from the publisher account), you’ll need to start in the AWS IAM console. You need to add an IAM policy so your account can subscribe to the publisher’s space using the role they created for you.

{
    "Version": "2012-10-17",
    "Statement": {
        "Effect": "Allow",
        "Action": "sts:AssumeRole",
        "Resource": "arn:aws:iam::<12-digit AWS account for publisher>:role/role-for-subscriber”
     }
}

 

Next, you’ll go to the Explore in Qubole, select Data Exchange. In Other Spaces, click the + button to add a Space. Enter the info provided to you by the publisher.

10

 

Once this Space is added, it becomes visible under the Other Spaces section. Selecting a spaces gives the list of Hive tables which can be imported by the subscriber. To import a table, click the dropdown on the right pane in the same row as the Hive table name and click on “Import”.

As soon as the table is imported, it becomes visible in the Qubole Hive metastore. If the imported table is partitioned, an alter table recover partitions query should be run from the Analyze screen before the table can be used further. Otherwise, the table can be queried upon directly.

And now, you’re done subscribing to a table – congratulations!

Data Exchange is generally available now for all Qubole customers. You can read more about it in our documentation. Try it out, and we’d love to hear your feedback!


Leave a Reply

Your email address will not be published. Required fields are marked *

Are you human? *

 
 
 
 
clear