A 5-Minute Guide to Apache Spark
- By Jonathan Buckley
- October 14, 2015
When it comes to big data tools, more than a few have peculiar names. You’ve got Hadoop, Hive, MongoDB, Pig, Presto—the list of quirky words goes on. And then there’s Apache Spark, which sounds a lot like the name of a 60’s rock band. In reality “Spark” is a formidable big data processing engine that’s gaining rock star status in the big data world these days. And major big data players are among its biggest fans. Case in point, IBM is pouring massive resources into Apache Spark project development through its newly created Spark Technology Center in San Francisco.
If the predictions of industry experts are correct, Apache Spark is on the verge of revolutionizing big data analytics. So if you’re in the dark as to what Apache Spark is and what it does, here’s a 5-minute guide to shed some light on this powerful Big Data tool.
What Spark Is
Simply stated, Spark is a scalable, open source big data processing engine designed for fast and flexible analysis of large datasets (big data). Developed in 2009 at UC Berkeley’s AMPLab, Spark was open sourced in March 2010 and submitted to the Apache Software Foundation in 2013, where it quickly became a top-level project.
Why Spark is Special
What sets Spark apart from other tools in the Hadoop herd is the ability to handle both batch and streaming workloads at lightening fast speeds.
Back in 2009, AMPLab researchers recognized that Hadoop MapReduce was too slow and inefficient to deliver on new and rapidly growing data needs, such as real-time data analysis. As a solution to that problem, Spark was purposely built to handle iterative and interactive computing jobs at record speeds.
Compared to MapReduce on Hadoop 2.0, Spark runs programs 100 times faster in memory and 10 times faster for complex applications running on disk.
Spark’s Big Advantage
Along with speed, Spark’s capabilities add up to one big advantage for big data analysts:
Ease of Use—By definition, MapReduce provides programming challenges for developers. In contrast, Spark allows users to quickly write applications in Java, Scala, or Python and build parallel applications that take full advantage of Hadoop’s distributed environment. In addition, Spark supports operations such as SQL queries, streaming data, and complex capabilities like iterative machine learning algorithms and interactive analytics right “out of the box”. And Spark allows users to seamlessly combine these multiple processing types into a single workflow. Spark is also one hundred percent compatible with Hadoop’s Distributed File System (HDFS), HBase, and any Hadoop storage system. That’s a critical advantage for organizations, as it makes virtually all of their existing data instantly usable in Spark.
Spark’s Key Components
Spark Core— The general execution engine of the Spark platform, Spark Core contains various components for functions such as task scheduling, memory management, fault recovery, etc. Spark’s application program interface (API) that defines Resilient Distributed Datasets (RDDs) also resides in Spark Core. Thanks to RDDs—which can be thought of as a collection of items distributed across a vast number of compute nodes operating in parallel—Spark is able to draw on Hadoop clusters for stored data and process that data in-memory at unprecedented speeds, allowing data to be explored interactively in real-time.
Spark SQL—Big data consists of structured and unstructured data, each of which is queried differently. Spark SQL provides an SQL interface to Spark that allows developers to co-mingle SQL queries of structured data with the programmatic manipulations of unstructured data supported by RDDs, all within a single application. This ability to combine SQL with complex analytics makes Spark SQL a powerful open source tool for the data warehouse.
Spark Streaming—This Spark component enables analysts to process live streams of data, such as log files generated by production web servers, and live video and Stock Market feed. By providing an API for manipulating data streams that is a close match to Spark Core’s RDD API, Spark Streaming makes it easy for programmers to navigate between applications that process data stored in memory, on disk, or as it arrives in real time.
MLlib—Spark comes with an integrated framework for performing advanced analytics. Among the components found in this framework is Spark’s scalable Machine Learning Library (MLlib). The MLlib contains common machine learning (ML) functionality and provides a varied array of machine learning algorithms such as classification, regression, clustering, and collaborative filtering and model evaluation, and more. Spark and MLlib are set to run on a Hadoop 2.0 cluster without any pre-installation.
GraphX—Also found in Spark’s integrated framework is GraphX, a library of common graph algorithms and operators for manipulating graphs and performing graph-parallel computations. Extending the Spark RDD and API, GraphX allows users to create directed graphs with arbitrary properties attached to each vertex and edge. GraphX is best used for analytics on static graphs, such as Facebook’s Friend Graph that helps to uncover patterns that exist within social network connections.
Apache Spark is one of the largest open source communities in big data. With the flexibility and scalability to deliver real-time processing, plus the ability to constantly evolve through open source contributions, Apache Spark is on its way to achieving rock star status as a premiere big data tool.
Looking for more? Learn the top use cases for Apache Spark