APACHE SPARK

Home >
Developers >
APACHE SPARK

NEW! Spark 3.3 is now available on Qubole. Qubole’s multi-engine data lake fuses ease of use with cost-savings. Now powered by Spark 3.3, it’s faster and more scalable than ever.

Apache Spark is a high-performance, distributed data processing engine that has become a widely adopted framework for machine learning, stream processing, batch processing, ETL, complex analytics, and other big data projects. Qubole has supported Apache Spark-as-a-Service since 2014 and has contributed several major projects (SparkLens) and optimizations (RubiX) back to the open-source community.

Spark Developer Guide

SPARK DATA LAKE

Qubole combines the biggest benefits of Spark: scalability, the speed of processing, and flexibility of languages; with an enterprise-ready data platform built to handle petabyte scale. With Qubole you can use your interface of choice — Notebooks, Web Console, SDK, or API — to build applications using Scala, Java, Python, or R. Qubole Spark runs some of the largest and most efficient clusters in the cloud, scaling from 10 to 1000 nodes and back down in minutes.

NEW! Spark 3.3 is now available on Qubole. Qubole’s multi-engine data lake fuses ease of use with cost-savings. Now powered by Spark 3.3, it’s faster and more scalable than ever.

BENEFITS OF SPARK ON QUBOLE

Apache Spark Costs

Advanced cost controls result in up to a 50% reduction in costs with Qubole

Apache Spark Performance

Performance optimizations and smart management tools that increase Spark processing efficiency

Learning Spark

Qubole makes Spark easier to use by automating back-end configuration and other day-to-day processes

Enterprise Spark

Enterprise-grade security, JDBC/ODBC connectors to enterprise data sources, and 3rd party integrations.

Apache Spark on Qubole vs. Open Source Apache Spark

Spark Scalability

	Apache Spark on Qubole	Apache Spark
Spot Bidding
Graceful Spot Shutdown
Spot Rebalancing
Workload-Aware Autoscaling
Aggressive Downscaling with graceful decommissioning
Container Packing
Heterogeneous Clusters
Per-second billing
Advanced Multi-tenancy

Spark Performance

	Apache Spark on Qubole	Apache Spark
Faster Reads
Faster writes
Compute Optimization for joins and filters
Fault isolation of compute resources
S3 Direct writes optimization
S3 listing optimization
Metadata Caching
Rubix (distributed caching)

Spark Workspaces

	Apache Spark on Qubole	Apache Spark
Multiple languages (PySpark, Spark SQL, Scala, etc)
Multiple data sources (S3, Redshift, Snowflake)
Versioning
Scheduling
Dashboarding
Collaboration and sharing

Spark Debugging and Profiling

	Apache Spark on Qubole	Apache Spark
Profiling (SparkLens)
Monitoring (Ganglia, DataDog, etc)
Intelligent Log Access

Spark Security

	Apache Spark on Qubole	Apache Spark
Access control for notebooks, clusters, jobs, structured data
Audit end-user activity logs
SSO with SAML 2.0 support
Data encryption (at rest and in motion)
HIPAA, SOC2 Type2, ISO-27001 compliant environments

Spark Integrations

	Apache Spark on Qubole	Apache Spark
Connect with BI tools with authenticated ODBC/JDBC (Tableau, Looker, etc.)
REST API (Talent, Informatica, RStudio, etc, Airflow, Oozie)
Data Source Connectors (Snowflake, Redshift, Kafka, Kinesis)

Service & Support

	Apache Spark on Qubole	Apache Spark
24/7 support from our Spark experts
Runs multiple versions of Apache Spark

APACHE SPARK

SPARK DATA LAKE