Blog

×

Connecting Jupyter with Remote Qubole Spark Cluster on AWS, MS Azure, and Oracle BMC

Jupyter™ notebooks is one of the most popular IDE of choice among Python users. Traditionally, most Jupyter users work with small or sampled datasets that do not require distributed computing. However, as data volumes grow and enterprises move toward a unified data lake (for example: Amazon S3, Azure Blob store), accessing this data from their IDE and powering their analyses through parallel computing frameworks such as Spark has become essential.

In this post, we will see how Jupyter users can leverage sparkmagic package to connect to a Qubole Spark cluster running Livy server. With this integration, Jupyter users can continue to use their IDE of choice, while achieving distributed computing through Spark on large datasets that reside in their enterprise data lake.

First, let’s look at the sparkmagic package and the Livy server, and the installation procedure that makes this integration possible.

What is Livy?

Livy is an open source RESTfull service for Apache Spark. Livy enables programmatic, fault-tolerant, multi-tenant submission of Spark jobs from web/mobile apps (no Spark client needed). Multiple users can interact with the Spark cluster concurrently and reliably. For more information, visit http://livy.io.

What is Sparkmagic?

Sparkmagic is a set of tools that enables Jupyter notebooks to interactively communicate with remote Spark clusters that are running Livy. The Sparkmagic project includes a set of `magics` for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.

How it all works together

Installing Livy

The following bootstrap script installs and runs livy on a master node of a Qubole Spark cluster.

wget http://archive.cloudera.com/beta/livy/livy-server-0.3.0.zip
unzip livy-server-0.3.0.zip
export SPARK_HOME=/usr/lib/spark 
export HADOOP_CONF_DIR=/etc/hadoop/conf
cd livy-server-0.3.0
mkdir logs
nohup ./bin/livy-server > //livy.out 2> //livy.err < /dev/null

Installing Sparkmagic

Jupyter Notebooks along with sparkmagic will reside on your local computer.

We assume that Jupyter Notebooks is already installed. If not, it’s easy to install from https://www.continuum.io/downloads.

Once the Jupyter Notebook is installed, run from command line:

pip install sparkmagic
jupyter nbextension enable --py --sys-prefix widgetsnbextension
pip show sparkmagic

Change directory to sparkmagic `Location` as reported by pip show sparkmagic

Choose and install kernels:

jupyter-kernelspec install sparkmagic/kernels/sparkkernel
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel
jupyter-kernelspec install sparkmagic/kernels/sparkrkernel

Finally, run this command:

jupyter serverextension enable --py sparkmagic

Configuring sparkmagic

Check if ~/.sparkmagic folder exists and has config.json in it. If it doesn't, create this folder. Copy the attached config.json into ~./.sparkmagic.

Open config.json from ~/.sparkmagic. Update url token in kernel_python_credential and kernel_scala_credentials sections to reflect cluster ID that you want to use. Update X-AUTH-TOKEN with the API Token that you can find on your Qubole My Accounts page. You can find more information on the API Token in http://docs.qubole.com/en/latest/rest-api/api_overview.html. Review other fields.

Note: To access the API token, navigate to Control Panel in UI at https://api.qubole.com/v2/control-panel and click the My Accounts tab on the left pane. Click Show for the account and copy the API token that is displayed.

Config.json

{
 "kernel_python_credentials" : {
 "username": "",
 "password": "",
 "url": "https://api.qubole.com/livy-spark-<cluster_id>"
 },

"kernel_scala_credentials" : {
 "username": "",
 "password": "",
 "url": "https://api.qubole.com/livy-spark-<cluster_id>"
 },

"logging_config": {
 "version": 1,
 "formatters": {
 "magicsFormatter": { 
 "format": "%(asctime)s\t%(levelname)s\t%(message)s",
 "datefmt": ""
 }
 },
 "handlers": {
 "magicsHandler": { 
 "class": "hdijupyterutils.filehandler.MagicsFileHandler",
 "formatter": "magicsFormatter",
 "home_path": "~/.sparkmagic"
 }
 },
 "loggers": {
 "magicsLogger": { 
 "handlers": ["magicsHandler"],
 "level": "DEBUG",
 "propagate": 0
 }
 }
 },

"wait_for_idle_timeout_seconds": 15,
 "status_sleep_seconds": 2,
 "statement_sleep_seconds": 2,
 "livy_session_startup_timeout_seconds": 180,

"fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",

"ignore_ssl_errors": false,

"session_configs": {
 "driverMemory": "1000M",
 "executorCores": 2
 },

"use_auto_viz": true,
 "max_results_sql": 2500,
 "pyspark_dataframe_encoding": "utf-8",
 
 "heartbeat_refresh_seconds": 5,
 "livy_server_heartbeat_timeout_seconds": 30,
 "heartbeat_retry_seconds": 2,

"server_extension_default_kernel_name": "pysparkkernel",
 "custom_headers": {
 "X-AUTH-TOKEN":"<API-Token>"
 }
}

Connecting to Qubole Spark Cluster with Authentication

Start the cluster if it’s not up yet. Start Jupyter Notebook from your OS or Anaconda menu or by running "jupyter notebook" from command line. It will open your default internet browser with Jupyter. Choose New, and then Spark or PySpark. The notebook will connect to Spark cluster to execute your commands. It will start Spark Application with your first command.

Configuring Spark Session

You can specify Spark Session configuration in the session_configs section of the config.json or in the notebook by adding %%configure as a very first cell. Here is an example:

%%configure -f
{
"driverMemory": "1000M",
"executorCores": 2,
"numExecutors" : 2,
"executorMemory" : "4g",
"conf" :
{ "spark.qubole.max.executors":20 }
}

 

 

With Jupyter, sparkmagic, and Livy, data scientists and other Python users can keep their existing Python code while benefitting from Qubole Spark cluster’s capabilities and analyze huge amounts of data stored in the cloud.

Qubole makes it easy to spin up clusters with different sizes and CPU/memory configurations to suit different workloads and budgets. If you have any questions about using Jupyter and Spark or would like share your use cases, please leave us your feedback at [email protected]

Share our Post

Leave a Reply

Your email address will not be published. Required fields are marked *