Apache Spark

Apache Spark is a fast, in-memory data processing engine that allows data teams to run a range of workload types, such as streaming, machine learning or interactive data exploration, that require fast iterative access to datasets.

Apache Spark

A self-managing and self-optimizing implementation of Spark

Qubole offers the first Autonomous Data Platform implementation of the Apache Spark open source project.

Runs on your choice of popular public Cloud infrastructure

Leverages the platform’s AIR (Alerts, Insights, Recommendations) capabilities to help data teams focus on the outcome, instead of the platform

Cloud Agent technology augments original Spark with a self-managing and self-optimizing platform:

Cloud-optimized for faster workload performance

Smarter object storage access for split computation, batching of writes, pre-fetching, and multiple caching layers, SSD Caching

Easier to integrate with existing data sets and tools

  • ODBC/JDBC drivers
  • Database connectors (MySQL, SQL Server, Oracle DB, RDS, Redshift, Kinesis and many others)
  • Comprehensive dictionary of REST APIs for application integration

Built in Notebooks-as-a-service support

  • Spark History Server allows easy debugging even after cluster is shut down
  • Spark Job Server for caching data and reusing Spark applications
  • Easy configuration of Interpreters
  • Easy sharing of Notebooks among users, with access rights control
  • Github Integration

Extensive range of libraries supported

  • MLlib (machine learning)
  • Spark AMI with deep learning libraries
  • GraphX (graph processing)
  • Spark SQL
  • Spark Streaming
  • SparkR (for running Spark on R)

Best-in-class security

  • HDFS and SSL encryption
  • SAML Authentication
  • VPC support
  • Dual IAM roles

AWS

Microsoft Azure

Oracle Cloud

Supported Versions

AWS: 1.6.2, 2.0.0, 2.0.2

Azure, Oracle: 2.1.0, 2.0.2

Spark in Qubole Documentation