What is Apache Spark?
Apache Spark is an open source cluster computing framework for fast and flexible large-scale data analysis. UC Berkeley’s AMPLab developed Spark in 2009 and open sourced it in 2010. Since this time, it has grown to become one of the largest open source communities in big data with over 200 contributors from more than 50 organizations. Because Spark is quickly experiencing enterprise adoption, Qubole is delivering Spark as a Service to make this framework easy and fast to deploy. Before we dive into the details of our new service, we’d like to start with telling you how Spark works and where it adds the most business value.
How Does Spark Work?
Spark is built on top of Hadoop Distributed File System, but rather than using Hadoop MapReduce, it relies on its own parallel data processing framework which starts by placing data in Resilient Distributed Datasets (RDDs), a distributed memory abstraction that performs calculations on large clusters in a fault-tolerant manner. Because data is persisted in-memory (and disc if it’s needed), Spark can be significantly faster and more flexible than Hadoop MapReduce jobs for certain applications described below. Spark also adds flexibility to its speed by offering APIs that allow developers to write queries in Java, Python or Scala.
What is Spark Good At?
Spark is good at most applications that require fast processing – iterative processing, interactive processing, streaming, high performance graphics and batch computations as well as unifying these historically distinct workloads.
Iterative Algorithms and Interactive Data Mining
Keeping data in-memory for better access time can improve performance for iterative algorithms and data mining by an order of magnitude. Common examples include:
1. Real-Time Queries – Spark’s super-fast queries can be executed against data in Hive, HDFS, HBase and Amazon S3.
2. Event Stream Processing – Alerting, aggregation and analytics for event-intensive applications such as algorithmic trading, fraud detection, process monitoring, location-based services, sensor data, social media, and log and click stream processing.
3. Iterative Algorithms – Spark is ideal for speeding up repetitive processing required by iterative algorithms such as clustering and classification.
4. Complex Operations – Spark supports operators such as joins, group-by, or reduce-by operations for quickly modeling and executing complex data flows.
5. Machine Learning – Built on top of Spark, MLLib is a scalable machine learning library that supplements Spark’s processing speed with high-quality algorithms.
6. Big Data Graphics Spark also includes a distributed graph system called GraphX. Social networks, targeted advertising, and geo-location are just a few of the many applications that need big graphs. These can be computation intensive without the power of Spark.
7. Faster Batch Processing Batch processing takes a large dataset as input all at once, processes it, and writes a large output. While Hadoop MapReduce handles batch processing, Spark can process batch jobs even faster. By reducing the number of writes and reads to disc, Spark is able to execute batch-processing jobs 10 to 100 times faster than the Hadoop MapReduce engine.
8. Unified Big Data Analytics Once in-memory, data can be shared among iterative processing, interactive processing, streaming, graphics and batch computations. This unification presents a range of exciting possibilities for new and innovative Big Data applications that bridge previously separate workloads such as real-time and historical data analytics and top-down and bottom-up data exploration. Big data graphs are also great complements to machine learning and data mining applications. Unified Big Data Analytics also have the added advantage of reducing the need to build, manage and maintain separate processing systems for different computational needs.
Why Spark as a Service From Qubole?
Understanding the value of Spark in Big Data analytics, Qubole’s goal is to deliver the power of Spark to both technical and business Hadoop users. Qubole is offering Spark as a Service to help organizations run Spark on AWS. With this Service, we have integrated Spark into our Qubole Data Service (QDS) platform, allowing users to launch and provision Spark clusters and start running queries in minutes. Spark as a Service makes it easy to process and query data stored in Hive, HDFS, HBase and Amazon S3.
Importantly, in the future, any number of data sources can be accessed and their data easily combined with Spark. For example, various SQL, NoSQL, and data sinks can be accessed from one interface, and their data can be combined and loaded into any of them (the latter is in development at the moment). QDS’ query editor and visual query builder give developers and data scientists an easy way to access Spark data with no specialized coding skills.
Spark as a Service also offers lower cost and ease of use. It reduces the cloud-compute cost of running Spark on AWS using self-service auto-scaling to scale capacity up and down as needed without having to manually reconfigure resources. Qubole’s Spark as a Service is currently in development.
If you have any questions about our future offering, please contact us.
Qubole is a significantly more polished product than EMR. Data scientists can explore their data in S3, create tables and query those tables all via an easy-to-use web UI
Qubole’s fantastic support has been key in our successful deployment. They continue to deliver of new features and revisit the ones that we ask for
Our goal at MediaMath was to take our existing industry leading infrastructure to the next level handling new complex analytics tasks. Qubole has helped us enable this goal with minimal risk.
Instead of worrying about provisioning clusters of machines or job flows or whatever, Qubole lets you focus on your data and your queries … The Qubole guys have been extremely helpful!
The service spins up users’ clusters only when a job is started, then automatically scales or contracts them based on the workload, and spins the servers down once the job is done.
Qubole’s Hadoop and Hive interfaces are vastly superior to the default CLIs, which scare business analysts and hinder meaningful analyses of the gaming logs that we collect. With Qubole, business analysts are self-sufficient in using a Big Data platform to meet their advanced analytic needs.
Online Gaming Company
top-performing technologies in the data industry are definitely taking aim at democratizing data tools and bringing the power of data to smaller businesses. This is a major change in the data industry, and Qubole Data Service is a great example
I’m very happy to be using Qubole in production. Qubole has saved me a lot of time, effort, and trouble in getting my data processing pipelines up and running. My data pipelines process Appnexus data in Amazon S3 which is then stored in Vertica. The engineering team understands the complexities and provided awesome support!
Real-time Ads Retargeting Startup
There’s a whole world of web companies, SMBs and other non-Facebooks or Yahoos that will want to use Hadoop but not want to run it in-house…offering a cloud service makes it easier for these users to get started with the platform and for Qubole to keep improving.
Qubole offers a big data ETL and exploration service through auto-scaling Hadoop clusters with a web user interface for data exploration and integration with various data sources. The service can do (nearly) everything EMR can do, and it goes further
Big Data Republic
Simba knows Big Data access. Qubole knows Big Data. Qubole’s founders authored Apache Hive, built key parts of the Hadoop eco-system and brought Apache HBase to Facebook
“The integration of Tableau and Qubole makes it faster and easier for our customers to operationalize Big Data…lowers the resource barriers to deriving the benefits of Big Data because customers can deploy our joint solution seamlessly and cost effectively.”