What is Apache Spark?

NEW! Spark 3.3 is now available on Qubole. Qubole’s multi-engine data lake fuses ease of use with cost-savings. Now powered by Spark 3.3, it’s faster and more scalable than ever.

What is Apache Spark?

Apache Spark is an open-source cluster computing framework for fast and flexible large-scale data analysis.  UC Berkeley’s AMPLab developed Spark in 2009 and open-sourced it in 2010. Since this time, it has grown to become one of the largest open source communities in big data with over 200 contributors from more than 50 organizations. Because Spark is quickly experiencing enterprise adoption, Qubole is delivering Apache Spark as a Service to make this framework easy and fast to deploy. Before we dive into the details of our new service, we’d like to start by telling you how Apache Spark projects work and where they add the most business value.

Apache Spark

Try Qubole for Spark

What is Spark in Big Data?

Spark is built on top of Hadoop Distributed File System, but rather than using Hadoop MapReduce, it relies on its own parallel data processing framework which starts by placing data in Resilient Distributed Datasets (RDDs), a distributed memory abstraction that performs calculations on large Spark clusters in a fault-tolerant manner. Because data is persisted in-memory (and disc if it’s needed), Apache Spark can be significantly faster and more flexible than Hadoop MapReduce jobs for certain applications described below. Apache Spark projects also add flexibility to its speed by offering APIs that allow developers to write queries in Java, Python or Scala.

Runs on your choice of popular public Cloud infrastructure

Leverages the platform’s AIR (Alerts, Insights, Recommendations) capabilities to help data teams focus on the outcome, instead of the platform

Cloud Agent technology augments original Spark with a self-managing and self-optimizing platform:

Cloud-optimized for faster workload performance

Smarter object storage access for split computation, batching of writes, pre-fetching, and multiple caching layers, SSD Caching

Easier to integrate with existing data sets and tools

  • ODBC/JDBC drivers
  • Database connectors (MySQL, SQL Server, Oracle DB, RDS, Redshift, Kinesis and many others)
  • Comprehensive dictionary of REST APIs for application integration

Built in Notebooks-as-a-service support

  • Spark History Server allows easy debugging even after cluster is shut down
  • Spark Job Server for caching data and reusing Spark applications
  • Easy configuration of Interpreters
  • Easy sharing of Notebooks among users, with access rights control
  • Github Integration

Extensive range of libraries supported

  • MLlib (machine learning)
  • Spark AMI with deep learning libraries
  • GraphX (graph processing)
  • Spark SQL
  • Spark Streaming
  • SparkR (for running Spark on R)

Best-in-class security

  • HDFS and SSL encryption
  • SAML Authentication
  • VPC support
  • Dual IAM roles

NEW! Spark 3.3 is now available on Qubole. Qubole’s multi-engine data lake fuses ease of use with cost-savings. Now powered by Spark 3.3, it’s faster and more scalable than ever.

AWS

Microsoft Azure

Oracle Cloud

Try Qubole for Spark

What is Spark good at?

Spark is good at most applications that require fast processing – iterative processing, interactive processing, streaming, high-performance graphics, and batch computations as well as unifying these historically distinct workloads.

Iterative Algorithms and Interactive Data Mining

Keeping data in memory for better access time can improve performance for iterative algorithms and data mining by an order of magnitude.  Common examples include:

1. Real-Time Queries – Spark’s super-fast queries can be executed against data in Hive, HDFS, HBase, and Amazon S3.

2. Event Stream Processing – Alerting, aggregation, and analytics for event-intensive applications such as algorithmic trading, fraud detection, process monitoring, location-based services, sensor data, social media, and log and click stream processing.

3. Iterative Algorithms – Spark is ideal for speeding up repetitive processing required by iterative algorithms such as clustering and classification.

4. Complex Operations – Spark supports operators such as joins, group-by, or reduce-by operations for quickly modeling and executing complex data flows.

5. Machine Learning – Built on top of Spark, MLLib is a scalable machine learning library that supplements Spark’s processing speed with high-quality algorithms.

6. Big Data Graphics Spark also includes a distributed graph system called GraphX.  Social networks, targeted advertising, and geo-location are just a few of the many applications that need big graphs.  These can be computation-intensive without the power of Spark.

7. Faster Batch Processing takes a large dataset as input all at once, processes it, and writes a large output. While Hadoop MapReduce handles batch processing, Spark can process batch jobs even faster. By reducing the number of writes and reads to disc, Spark is able to execute batch-processing jobs 10 to 100 times faster than the Hadoop MapReduce engine.

8. Unified Big Data Analytics Once in-memory, data can be shared among iterative processing, interactive processing, streaming, graphics, and batch computations.  This unification presents a range of exciting possibilities for new and innovative Big Data applications that bridge previously separate workloads such as real-time and historical data analytics and top-down and bottom-up data exploration.  Big data graphs are also great complements to machine learning and data mining applications. Unified Big Data Analytics also has the added advantage of reducing the need to build, manage and maintain separate processing systems for different computational needs.

Why Apache Spark as a Service From Qubole?

Understanding the value of Apache Spark projects in Big Data analytics, Qubole’s goal is to deliver the power of Spark to both technical and business Hadoop users.  Qubole is offering Spark as a Service to help organizations run Spark on AWS.  With this service, we have integrated Spark into our Qubole Data Service (QDS) platform, allowing users to launch and provision Spark clusters and start running queries in minutes. Spark as a Service makes it easy to process and query data stored in Hive, HDFS, HBase, and Amazon S3.

Importantly, in the future, any number of data sources can be accessed and their data easily be combined with Spark. For example, various SQL, NoSQL, and data sinks can be accessed from one interface, and their data can be combined and loaded into any of them (the latter is in development at the moment). QDS’ query editor and visual query builder give developers and data scientists an easy way to access Spark data with no specialized coding skills.

Spark as a Service also offers lower cost and ease of use.  It reduces the cloud-compute cost of running Spark on AWS using self-service auto-scaling to scale capacity up and down as needed without having to manually reconfigure resources.

If you have any questions about our future offering, please contact us.

Try Qubole for Spark