About Qubole Open Data Lake Platform
Qubole is an open and secure data lake platform for machine learning, streaming, and ad-hoc analytics.
Qubole enhanced its platform security by supporting AWS PrivateLink in 2019. Qubole with AWS PrivateLink makes it easy to connect services across different AWS accounts and VPCs, and significantly simplifies network architecture. When a customer configures the Qubole Platform through AWS’s PrivateLink connectivity, the traffic between Qubole VPC and the customer’s VPC does not traverse through the public internet.
AWS PrivateLink Overview
AWS PrivateLink is supported by a Virtual Private Cloud (VPC) endpoint, a logical networking component provided within the VPC.
A VPC is a virtual network dedicated to your AWS account. It is logically isolated from other virtual networks in the AWS Cloud.
Figure 1.1: source https://aws.amazon.com/privatelink
There are two types of VPC endpoints as per the AWS console: gateway endpoints and interface endpoints
A VPC gateway endpoint is a gateway that you specify as a target for a route in your route table for traffic destined to a supported AWS service such as Amazon S3 and DynamoDB.
A VPC interface endpoint is an Elastic Network Interface (ENI), which is a logical networking component in a VPC that represents a virtual network card with a private IP address from the IP address range of your subnet. The ENI serves as an entry point for traffic destined for a supported service within the VPC. Interface endpoints are powered by AWS PrivateLink, a technology that enables you to privately access services by using private IP addresses.
VPC Endpoints are virtual devices. They are horizontally scaled, redundant, and highly available VPC components. They allow communication between instances in your VPC and services without imposing availability risks or bandwidth constraints on your network traffic.
It is possible that two VPCs could be either in the same or different AWS accounts. Connecting VPC endpoints via AWS PrivateLink does not require an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. When VPC endpoints are connected via AWS PrivateLink, the traffic between two VPCs does not leave the Amazon network. This helps in simplifying your network architecture.
Importance of Security at Qubole
Qubole is committed to providing a secure, reliable, and performant cloud-native platform for customers.
Qubole uses a multi-layer approach to protect the confidentiality, integrity, and availability of customer data. We follow best practices in security and governance to deliver enterprise-ready capabilities. Qubole provides secure access and protects data artifacts with encryption and Role-Based Access Controls (RBAC), auditing, and compliance with industry and governmental regulations like — SOC2 Type2, HIPAA, and GDP
Qubole protects the data at rest, in transit, and in use, by providing secure access to the platform and encrypting all data and metadata.
PrivateLink Technical Details
Qubole, from its AWS account, creates and manages the cluster’s lifecycle. The customer’s AWS account becomes a service provider of the AWS services (APIs) consumed by Qubole.
The following figure represents a typical setup without PrivateLink where a bastion host is configured in the public subnet, and traffic between two accounts goes through the internet. Even though the bastion host has hardened security to withstand attacks, the traffic is still going via the internet and certain businesses may not want to take that risk.
When Qubole and the customer’s VPC are connected via PrivateLink, Qubole’s VPC will be sending traffic to the customer’s VPC over a private and secured connection without going through the internet. This helps increase the security of the architecture.
Besides, Qubole recommends an S3 endpoint for Clusters to access Amazon S3 (in the same region) over a private and secured connection (Amazon’s network) without going through the internet.
Though the complete setup requires the customer account to have VPC with private and public subnets, the PrivateLink connection between Qubole and the customer’s VPC will be between two private subnets. Therefore the traffic between two accounts does not leave the Amazon network, as shown in the diagram above.
Here are high-level steps to set up a PrivateLink between Qubole’s VPC and Customer’s VPC:
- Create a VPC with public and private subnets (ensure multiple availability zones)
- Create a Bastion Host (An EC2 instance with restricted inbound and outbound rules with specific ports) within one of the private subnets
- Create and attach a security group to the bastion host allowing inbound traffic on port 22 from the private subnets of the VPC
- Create a Network Load Balancer (NLB) with the internal scheme and associate it with the VPC and (preferably more than one) private subnets of the VPC, allowing communication over port 22
- Create a new target group and add the newly created bastion host as one of the targets
- Add the newly created target group as the listener on the NLB
- Create a new VPC “Endpoint Services” and associate the newly created NLB
- Allow Qubole to communicate to the newly created “Endpoint Service” by whitelisting Qubole as one of the trusted principles
- Provide the “Endpoint Service” name i.e., the “Service Name” to Qubole
- When Qubole sends a connection request to the “Endpoint Service” and then accepts the request
These steps make Qubole’s VPC Endpoint (ID) an accepted “Endpoint Connections” within customer-created “Endpoint Services”. Now, Qubole will consume AWS services (APIs) via PrivateLink to create and manage clusters within the customer’s private subnet i.e. all the traffic between Qubole and Customer VPCs will remain within the AWS network.
Establishing PrivateLink is an easy and secure connection between the VPCs across AWS accounts for keeping traffic/data private.
By leveraging PrivateLink, Qubole provides the following benefits:
- Simplified Architecture
- Securely consume cloud-based services (SasS offerings)
- Secure your traffic and data as it does not traverse via the public internet
- Save time and reduce the possibility of network/security misconfiguration (no firewall rule, route table, etc)
- Maintain compliance
- Simplifies your network management between services across VPCs (between different accounts)
- Remove the need to set up IGW, NAT device, public VPC peering, or VPN connection
- AWS Direct Connect is not required, but its use is supported