Introducing Qubole Release 59

Start Free Trial
May 26, 2020 by Updated March 22nd, 2024

Qubole regularly releases its software for processing petabytes of data on the cloud through major releases once a quarter. This is in addition to several hotfixes and quick fixes that get released frequently to fix production bugs and release smaller features. Qubole Release 59  (R59) is our 2nd major release for the year 2020.

This major release comprises several big features, bug fixes, and performance enhancements across Qubole’s open data lake platform,  such as:

  • Workbench improvements and General Availability (GA)  across Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) for Data Analysts.
  • AirFlow enhancements and Qubole Pipelines Service for Data Engineers
  • Package management and Jupyter enhancements for Data Scientists.
  • Several cluster administration enhancements, including integration with Azure Pre-Provisioned VM Service, Spot Block for auto-scaling nodes; Changes to master node composition.
  • Several data processing  engine enhancements, including:
    • ACID Capabilities: Read support added for all data processing engines.
    • Spark : Spark ACID Datasource 0.5.0 release (read / write).
    • Hive: Hive 1.2 deprecation; Hive 3.1.1 is now more robust and performant; Hive 2.3.6 upgrade; and support for Tez 0.9. Hive ACID support for read/write.
    • Presto: Presto 317 GA; Enhancements in Presto for next-generation cloud-agnostic ODBC/JDBC drivers; multiple automated workload management and operational governance capabilities; and improvements to dynamic filtering, and ACID read support.

Following is a description of these features in more detail.

For Data Analysts

We are proud to announce the general availability of Workbench in AWS, Azure, and GCP  environments. Key highlights include:

  • Gradual rollout progressing for collections and live cluster health metrics features.
  • Ability to list contents of AWS S3 buckets from any region in the Workbench storage tab.

For Data Engineers

Qubole Pipelines Service is now available as an open beta to all users on AWS and GCP. We have added interesting new additions since the last ‘closed beta’ release:

  • Ease-of-management: Ability to edit or upgrade pipelines.
  • Lower TCO: SLA-aware autoscaling recommendations to either lower or increase the size of the cluster.
  • Dedicated Apache Spark Streaming cluster to run Pipelines Service. This helps in better isolation between batch and streaming workloads, as well as easier cluster and user management.

We are proud to announce the beta release of Apache Airflow 1.10.9. It is compatible with Python 3.7. The latest release contains an important bug fix, please refer to the Changelog for details.

Airflow CLI tool is now available as Open Source Software (OSS). This tool helps in the management and deployment of Airflow projects faster and smoother. Users can generate a project structure, boilerplate code, and even deploy the entire project using this CLI. For complete documentation and to download the tool, click here.

GIT is now integrated with Airflow clusters through DAG explorer. This allows integration of Continuous Integration / Continuous Deployment (CI/CD) of Airflow DAGs with your version control. On the Airflow cluster’s DAG explorer view, just specify the GIT project and folder for your DAGs. All changes will be synced with the Airflow cluster associated with the GIT location.

For Data Scientists

Jupyter V2 Enhancements:

New out-of-the-box Visualizations: QViz (Qubole Visualization Library) is a front-end extension exclusively built by Qubole to allow data from Spark to be rendered as various charts and plots in the User Interface (UI).

Intelligent Code Complete with contextual docstring help: Intelligent Code Complete offers intelligent code suggestions directly in Qubole’s Jupyter-based Notebooks. Contextual docstring helps complete this experience by providing contextual help while authoring Notebooks.

Notebook workflows %run: The %run allows chaining of Notebooks to stitch a sequential Extract-Transform-Load (ETL) workflow in another wrapper Notebook that can be scheduled. It also allows the inclusion/concatenation of Notebooks.

Package Management Enhancements

User experience (UX) Improvements: Qubole Package Management provides improved UX with the following capabilities

  • “View live logs” for a better debugging experience.
  • “Download logs from activity history” provides access to archived logs and improves debugging experience.
  • “Restore package from activity history” provides the ability to restore previously installed versions of the packages.

Private PyPI / Conda Channels and upload of Wheel / Egg packages in Package Management: Qubole Package Management now supports management and use of private Conda and PyPI Channels and provisioning of Wheel / Egg files in Spark clusters.

Cluster Administration Enhancements

In Release 59 we have added a number of features for cluster administration that reduce Total Cost of Ownership (TCO) and improve overall ease of use.  Major highlights include:

Capacity-optimized Spot allocation: Qubole heterogeneous clusters will automatically launch Spot instances into the pool {instance_type, AZ} with the highest available capacity based on real-time capacity data.

Spot Block for autoscaling nodes: Qubole now allows you to choose Spot Block nodes for autoscaling nodes. Now, customers can replace On Demand with Spot Block nodes in autoscaling nodes without changing the master and minimum worked node composition.

Master node composition: Qubole now allows users to configure master node composition (On-Demand, Spot) separately from Minimum Worker Nodes.

Multi-cloud Enhancements

Integration with Azure Pre-Provisioned VM Service: This new service from Azure pre-provisions a pool of VMs per region based on usage. Qubole is now integrated with this service to reduce variability in cluster start and autoscale times.

Data Processing Engines

Spark

  • Spark ACID DataSource 0.5.0,  compatible with Spark 2.4.3. With the new release, the data source uses Spark’s native reader, adds support for streaming writes, adds SQL support for Update and Delete, and includes improvements in transactionality.
  • Improvements and fixes to dynamic partition pruning and adaptive operators made to Spark 2.4.3, which leads to significant performance improvements.

Hive

  • Hive 1.2 is marked deprecated in the UI from R59 onwards.  Customers have the option to upgrade to HIve 2.3.6 or Hive 3.1.1. We also encourage customers to use Tez, as MR is deprecated in OSS Hive.
  • All Qubole Managed Metastore database schemas are upgraded to Hive 2.3
  • Hive 3.1.1 is now more robust and performant. A number of patches from OSS Hive (4.x) are backported to boost performance by 10-15% and improve the overall experience in terms of reliability and stability.
  • Hive 2.3.6 upgrade and support for Tez 0.9: Hive 2.3 is updated with the latest patches from OSS and now supports the use of Tez 0.9 with improved performance, reduced cost, and higher stability.

Presto

  • Presto 317 is now generally available in all Qubole environments. In our benchmarks, Presto 317 provided up to 15% out-of-the-box performance improvement over Presto 0.208 for TPCDS (Scale: 1000) queries.
  • Enhancements in Presto for next-generation ODBC/JDBC drivers.
  • Multiple automated workload management and operational governance capabilities in Presto for improved performance, reliability, and TCO reduction. This includes enhanced admission control based on embedded soft concurrency limits to prevent overwhelming small clusters. Also makes fundamental changes to Qubole’s workload-aware autoscaling.
  • Many improvements to Dynamic Filtering, including the ability to limit max partitions per table scan at runtime to allow relaxing of the compile-time setting (hive.max-partitions-per-scan); extension of dynamic filtering optimization to semi joins to take advantage of a selective build side in queries with IN clause; support for predicate push-down for dynamic filters in ORC and Parquet readers to reduce data scanned; efficiency improvements to dynamic row filtering just after table scan to save on network IOPS and memory.

In summary, Release 59 offers a wide range of new and enhanced capabilities for improved productivity and reduced TCO. To learn more about R59, please refer to the What’s New in discover.qubole.com and the Release Notes in our product documentation.  And let us know what you think via the “Send Feedback” button on the top right of the Qubole user interface.

Start Free Trial
Read Snowflake Pricing – say no to Snowflake tax