Introducing Qubole Release 57

Start Free Trial
October 29, 2019 by , and Updated February 28th, 2021

Each month, about an exabyte of data is processed using Qubole’s data platform on Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, for a variety of use cases spanning data engineering pipelines, data science/machine learning, and advanced analytics. This not only requires the vast computing power of the cloud and the proven autoscaling and automation of Qubole, but it also demands a continuous flow of enhancements, fixes, and new features to support the ever-increasing data processing needs of our customers.

To that end, release 57 (R57) brings many new capabilities and enhancements that help simplify and improve the efficiency and performance of your data processing projects. This blog provides highlights within each category—administration, data engineering, and data science—and links to further details.


Infrastructure Administration

Release 57 brings improvements in administration productivity, cost control and reliability of the cloud infrastructure:

  • Qubole’s advanced cluster management and low-cost compute capabilities have enabled customers to operate clusters at high ratios of low-cost compute instances—i.e. AWS’ Spot, GCP’s preemptible VMs, or Azure’s low-cost VMs. To support these types of configurations reliably, Qubole now allows separate master and worker node configurations for a cluster. For example, you can now run the master with an On-Demand instance, while the workers can run on 100% low-cost instances
  • R57 brings improved support for AWS EC2’s Spot Blocks, since they provide higher reliability than Spot nodes, albeit at a slightly higher cost. In order to minimize any adverse impact on workloads or queries, Qubole can now proactively and automatically replace Spot Blocks with new instances that have longer time blocks before they expire.
  • When operating on a shared infrastructure, query performance will vary based on the demand and health of the infrastructure. Therefore, R57 enables better planning, providing administrators with live cluster health metrics and cleanup activity from within the cluster details user interface, as well as end user applications like Workbench.

Security Administration

Provide data access controls, role based access controls and governance controls:

  • New Apache Ranger integration with Apache Spark for row and column-level data access controls —in addition to the existing Ranger support for Hive and Presto.
  • Enhanced security for cluster orchestration through AWS PrivateLink and restrictive user privileges.
  • Simplified user governance with directory services integration.

Data Engineering

R57 further simplifies data engineering tasks to support data restatements and achieve regulatory compliance:

  • ACID Transactions support on data lakes using Hive 3.1.1. (a project that we open-sourced). Read our technical blog and product blog for further details on reads using Presto and Spark, and row-level updates and deletes to address GDPR / CCPA requirements for Right To Erasure (RTE) and Right To Be Forgotten (RTBF).

Data Science

Release 57 delivers new applications, new debugging tools and Incremental performance improvements:

  • Integration of RStudio Pro with Spark clusters (ver. 2.2 and 2.3), in partnership with RStudio. This integration allows data scientists to use Rstudio Pro hosted on the cluster master, which is accessible from the Resources Link section on the clusters page.
  • Support for public and private installations of Enterprise Github and Gitlab with Qubole Notebooks for development and Continuous Integration and Continuous Delivery (CI/CD).
  • Qubole’s Spark tuning tool, SparkLens is now available in open source and also integrated into Workbench for profiling and optimizing Spark jobs. Read more here.
  • Revised troubleshooting guides with updated tips; techniques to fix common issues; and new articles based on customer feedback. Read more here.

Cloud-specific Capabilities

Google Cloud Platform

  • Custom Commit Plan on the GCP Marketplace, which allows you to purchase Qubole using private quotes, with customized terms of service and integrated billing with Google.
  • Enhancements to Qubole’s support for Preemptible VMs, including the option to fallback to On-Demand instances, rebalancing, and automatic Preemptible node rotation for better loss-handling.
  • Support for Presto 0.208 for data discovery and Petabyte-level queries, with full support of Qubole’s native workload-aware autoscaling capabilities. Presto is now available from Workbench or JDBC drivers.
  • Encryption support for data at rest based on Google’s Cloud Key Management Service; plus in-flight encryption between cluster nodes.
  • The Qubole Control Tier is now available in the European Union to address data locality needs.

Microsoft Azure

  • Support for Azure Data Lake Storage Gen2 from Presto, Spark, and Hive, including access controls based on each users’ Azure Active Directory permissions.
  • The Qubole Control Tier is now available in the European Union to address data locality needs.

Congratulations to the Engineering and Product teams that made R57 possible, and to our customers, that provided invaluable feedback during testing phases.

To learn more about R57 please refer to the Release Notes in our product documentation, and let us know what you think via the “Send Feedback” button on the top right of the Qubole user interface.

Start Free Trial
  • Blog Subscription

    Get the latest updates on all things big data.
  • Recent Posts

  • Categories

  • Events

    QUBOLE LIVE DEMO: Google Cloud Platform (GCP) Enables You To Simplify Today and Future Proof for Tomorrow

    Jan. 27, 2022 | Global

    Data Lake and Data Warehouse – A modern data strategy discussion

    Feb. 2, 2022 | Online

    QUBOLE LIVE DEMO: Stop The Cloud Cost Madness With Graviton and AWS. Switch And Save to Reduce Your Data Lake Costs Today

    Feb. 3, 2022 | Global

    CONTINUOUS INTELLIGENCE DAY – Continuous Intelligence in Finance 2022 and beyond

    Feb. 24, 2022 | Global

    Data Innovation Summit MEA 2022

    Mar. 7, 2022 | Global

    Data2030 Summit 2022 – APAC Edition – Data Strategies For Data And AI-Driven Organisations

    May. 24, 2022 | Global
  • Read Practical Guide to Financial Governance of Data Lake Initiatives