Apache Spark on Qubole

Apache Spark is a high-performance, distributed data processing engine that has become a widely adopted framework for machine learning, stream processing, batch processing, ETL, complex analytics, and other big data projects. Qubole has supported Apache Spark-as-a-Service since 2014 and has contributed several major projects (SparkLens) and optimizations (RubiX) back to the open-source community.

Apache Spark

Apache Spark on Qubole: Built for the Cloud

Qubole combines the biggest benefits of Spark: scalability, the speed of processing, and flexibility of languages; with an enterprise-ready data platform built to handle petabyte scale. With Qubole you can use your interface of choice — Notebooks, Web Console, SDK, or API — to build applications using Scala, Java, Python, or R. Qubole Spark runs the some of the largest and most efficient clusters in the cloud, scaling from 10 to 1000 nodes and back down in minutes.

Four Key Benefits of Apache Spark on Qubole

Cost-Efficiency

Advanced cost controls result in up to a 50% reduction in costs with Qubole

Improved Performance

Performance optimizations and smart management tools that increase Spark processing efficiency

Ease of Use

Qubole makes Spark easier to use by automating back-end configuration and other day-to-day processes

Enterprise-Ready

Enterprise-grade security, JDBC/ODBC connectors to enterprise data sources, and 3rd party integrations.

Apache Spark on Qubole vs.
Open Source Apache Spark

 

Scalability

Apache Spark on QuboleApache Spark
Spot Bidding
Graceful Spot Shutdown
Spot Rebalancing
Workload-Aware Autoscaling
Aggressive Downscaling with graceful decommissioning
Container Packing
Heterogeneous Clusters
Per-second billing
Advanced Multi-tenancy

Performance

Apache Spark on QuboleApache Spark
Faster Reads
Faster writes
Compute Optimization for joins and filters
Fault isolation of compute resources
S3 Direct writes optimization
S3 listing optimization
Metadata Caching
Rubix (distributed caching)

Workspaces

Apache Spark on QuboleApache Spark
Multiple languages (PySpark, Spark SQL, Scala etc)
Multiple data sources (S3, Redshift, Snowflake)
Versioning
Scheduling
Dashboarding
Collaboration and sharing

Debugging and Profiling

Apache Spark on QuboleApache Spark
Profiling (SparkLens)
Monitoring (Ganglia, DataDog, etc)
Intelligent Log Access

Security

Apache Spark on QuboleApache Spark
Access control for notebooks, clusters, jobs, structured data
Audit end-user activity logs
SSO with SAML 2.0 support
Data encryption (at rest and in motion)
HIPAA, SOC2 Type2, ISO-27001 compliant environments

Integrations

Apache Spark on QuboleApache Spark
Connect with BI tools with authenticated ODBC/JDBC (Tableau, Looker, etc.)
REST API (Talent, Informatica, RStudio etc, Airflow, Oozie)
Data Source Connectors (Snowflake, Redshift, Kafka, Kinesis)

Service & Support

Apache Spark on QuboleApache Spark
24/7 support from our Spark experts
Runs multiple versions of Apache Spark