Spark

In this release, Spark on Qubole provides various new features and enhancements in terms of performance, usability, debuggability, and security.

Some of the key features are listed below:

Note

Spark 2.3-latest is set to Spark 2.3.2 version on the QDS UI. QDS clusters running on 2.3-latest version will start to run on 2.3.2 version after a cluster restart.

New Features

Amazon S3 Select Integration

SPAR-2932: Amazon S3 Select is now integrated with Spark on Qubole, and works for CSV and JSON data sources and tables.

With this integration, S3 Select can read S3-backed tables created upon CSV or JSON files. This feature enables CSV or JSON tables or file formats to be automatically converted to S3 Select optimized format for faster and efficient data access. This integration improves performance of the queries related to CSV and JSON data sources.

This feature is supported on Spark 2.4 and later versions. Disabled.

Python UDF Pushdown

SPAR-3106: Python UDF pushdown is optimized to improve join performance by pushing down UDF when the joined output is larger than individual tables. This feature is supported on Spark 2.4 and later versions.

Qubole Job History Server Upgrade

SPAR-3053: The multitenant Qubole Job History Server that serves log and history of Spark jobs run on terminated cluster is now upgraded to Spark 2.3. By default, the offline SHS is set to 2.3.1.

Support for Hive Authorization Admin Commands

SPAR-2786: Spark on Qubole now supports Hive Admin commands to allow users to grant privileges such as (SELECT, UPDATE, INSERT and DELETE) to other users or roles. Via Support, Disabled.

The following commands are supported:

  • Set role
  • Grant privilege (SELECT, INSERT, DELETE, UPDATE or ALL)
  • Revoke privilege (SELECT, INSERT, DELETE, UPDATE or ALL)
  • Grant role
  • Revoke role
  • Show Grant
  • Show current roles
  • Show roles
  • Show role grant
  • Show principals for role.

This feature is supported on Spark 2.4 and later versions.

Enhancements

  • SPAR-3060: If you are using custom packages, you can remove the package from the Edit Cluster settings page for a Spark cluster and select any supported Spark version. By default, version 2.3 is selected.
  • SPAR-3003: The cluster AMI has PyArrow package to support Pandas UDFs, which supports the performance improvements in Spark 2.3.1. This enhancement is available via Support and is disabled by default for Spark 2.3.1, and is enabled by default for Spark 2.4 and later versions.
  • SPAR-2649: You can dynamically change the min executors and max executors value of a running Spark application from the Executors tab of the Spark Application UI. This feature is supported on Spark 2.3.1 and later.

Bug Fixes

  • SPAR-2989: Executors log links now work for all the scenarios: running application, completed application running on the same or different cluster, and Offline SHS. This issue is fixed in Spark 2.3.1, 2.3.2, and 2.4.0 versions.
  • SPAR-3059: For native ORC with DirectFileOutputCommitter, if a task fails after writing partial files, the reattempt also fails with FileAlreadyExistsException and the job fails. This issue is fixed in Spark 2.4 version.
  • SPAR-3191: Hive authorization failed when a paragraph was run using Spark 2.3 cluster. This issue is fixed now in Spark 2.3.

For a list of bug fixes between versions R54 and R55, see Changelog for api.qubole.com.