Share RDDs Across Jobs with Qubole’s Spark Job Server
When we launched our Spark as a Service offering in February, we designed it to run production workloads. Users would write standalone Spark applications and run them via our UI or API. We then enhanced the offering by adding support for running these standalone Spark applications on a schedule using our scheduler or as part of a complex workflow involving Hive queries and Hadoop jobs. We also launched Spark Notebooks to enable interactive data exploration – something at which Spark really excels.
Introducing the Spark Job Server
Today, we are announcing the launch of the Spark Job Server API. The Job Server lets you share Spark RDDs (Resilient Distributed Datasets) in one spark application amongst multiple jobs. This enables use cases where you spin up a Spark application, run a job to load the RDDs, then use those RDDs for low-latency data access across multiple query jobs. For example, you can cache multiple data tables in memory, then run Spark SQL queries against those cached datasets for interactive ad-hoc analysis.
You can use the Job Server to reduce the end-to-end latencies of small unrelated Spark jobs. In our tests, we noticed that using the Job Server brought end-to-end latencies of very small Spark jobs down from more than a minute down to single-digit seconds. The major reason for this performance improvement is that with the Job Server you already have a Spark application running to which you submit your SQL query or Scala/Python snippet. On the other hand, without the Job Server, each SQL query or Scala/Python snippet submitted to our API would start its own application and incur that overhead latency. This is because that API was designed to run standalone applications.
There are multiple open source implementations of Job Server-like functionality for Spark: there is spark-jobserver which was open sourced by Ooyala, and there is also the Livy REST Spark Server by the Hue team at Cloudera.
At Qubole, we decided to use Apache Zeppelin as the backend for our Job Server API. Apache Zeppelin already powers our Spark Notebooks and we have built some in-house operational expertise on managing it. It already had most of the features we needed for the job server like the ability to start a long running application, the ability to submit snippets in multiple languages to that application, the ability to monitor the status of those snippets, so we decided to use it. Using the same backend for both Spark Notebooks and Spark Job Server means that you can even use the RDDs created by the Spark Job Server inside Notebooks.
P.S. We are hiring and looking for great developers to help us build stuff like this and more. Drop us a line at [email protected].