Jupyter™ notebooks is one of the most popular IDE of choice among Python users. Traditionally, most Jupyter users work with small or sampled datasets that do not require distributed computing. However, as data volumes grow and enterprises move toward a unified data lake (for example: Amazon S3, Azure Blob store), powering their analyses through parallel computing frameworks such as Spark and accessing this data from their IDE has become essential.
In this post, we will see how Jupyter users can leverage sparkmagic package to connect to a Qubole Spark cluster running Livy server. With this integration, Jupyter users can continue to use their IDE of choice, while achieving distributed computing through Spark on large datasets that reside in their enterprise data lake.
First, let’s look at the sparkmagic package and the Livy server, and the installation procedure that makes this integration possible.
What is Livy?
Livy is an open source RESTfull service for Apache Spark. Livy enables programmatic, fault-tolerant, multi-tenant submission of Spark jobs from web/mobile apps (no Spark client needed). Multiple users can interact with the Spark cluster concurrently and reliably. For more information, visit https://livy.io.
What is Sparkmagic?
Sparkmagic is a set of tools that enables Jupyter notebooks to interactively communicate with remote Spark clusters that are running Livy. The Sparkmagic project includes a set of `magics` for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.
How it all works together
Installing Livy
The following bootstrap script installs and runs livy on a master node of a Qubole Spark cluster.
source /usr/lib/hustler/bin/qubole-bash-lib.sh
is_master=`nodeinfo is_master`
if [[ "$is_master" == "1" ]]; then
cd /media/ephemeral0
mkdir livy
cd livy
wget https://archive.cloudera.com/beta/livy/livy-server-0.3.0.zip
unzip livy-server-0.3.0.zip
export SPARK_HOME=/usr/lib/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf
cd livy-server-0.3.0
mkdir logs
echo livy.spark.master=yarn >> conf/livy.conf
echo livy.spark.deployMode=client >> conf/livy.conf
echo livy.repl.enableHiveContext=true >> conf/livy.conf
nohup ./bin/livy-server > ./logs/livy.out 2> ./logs/livy.err < /dev/null &
fi
Installing Sparkmagic
Jupyter Notebooks along with sparkmagic will reside on your local computer.
We assume that Jupyter Notebooks is already installed. If not, it’s easy to install from
.
Once the Jupyter Notebook is installed, run from command line:
pip install sparkmagic jupyter nbextension enable --py --sys-prefix widgetsnbextension pip show sparkmagic
Change directory to sparkmagic `Location` as reported by pip show sparkmagic
Choose and install kernels:
jupyter-kernelspec install sparkmagic/kernels/sparkkernel jupyter-kernelspec install sparkmagic/kernels/pysparkkernel jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel jupyter-kernelspec install sparkmagic/kernels/sparkrkernel
Finally, run this command:
jupyter serverextension enable --py sparkmagic
Configuring sparkmagic
Check if ~/.sparkmagic folder exists and has config.json in it. If it doesn’t, create this folder. Copy the attached config.json into ~./.sparkmagic.
Open config.json from ~/.sparkmagic. Update url token in kernel_python_credential and kernel_scala_credentials sections to reflect cluster ID that you want to use. Update X-AUTH-TOKEN with the API Token that you can find on your Qubole My Accounts page. You can find more information on the API Token in https://docs.qubole.com/en/latest/rest-api/api_overview.html. Review other fields.
Note: To access the API token, navigate to Control Panel in UI at https://api.qubole.com/v2/control-panel and click the My Accounts tab on the left pane. Click Show for the account and copy the API token that is displayed.
Config.json
{ "kernel_python_credentials" : { "username": "", "password": "", "url": "https://api.qubole.com/livy-spark-<cluster_id>" }, "kernel_scala_credentials" : { "username": "", "password": "", "url": "https://api.qubole.com/livy-spark-<cluster_id>" }, "logging_config": { "version": 1, "formatters": { "magicsFormatter": { "format": "%(asctime)s\t%(levelname)s\t%(message)s", "datefmt": "" } }, "handlers": { "magicsHandler": { "class": "hdijupyterutils.filehandler.MagicsFileHandler", "formatter": "magicsFormatter", "home_path": "~/.sparkmagic" } }, "loggers": { "magicsLogger": { "handlers": ["magicsHandler"], "level": "DEBUG", "propagate": 0 } } }, "wait_for_idle_timeout_seconds": 15, "status_sleep_seconds": 2, "statement_sleep_seconds": 2, "livy_session_startup_timeout_seconds": 180, "fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.", "ignore_ssl_errors": false, "session_configs": { "driverMemory": "1000M", "executorCores": 2 }, "use_auto_viz": true, "max_results_sql": 2500, "pyspark_dataframe_encoding": "utf-8", "heartbeat_refresh_seconds": 5, "livy_server_heartbeat_timeout_seconds": 30, "heartbeat_retry_seconds": 2, "server_extension_default_kernel_name": "pysparkkernel", "custom_headers": { "X-AUTH-TOKEN":"<API-Token>" } }
Connecting to Qubole Spark Cluster with Authentication
Start the cluster if it’s not up yet. Start Jupyter Notebook from your OS or Anaconda menu or by running “jupyter notebook” from command line. It will open your default internet browser with Jupyter. Choose New, and then Spark or PySpark. The notebook will connect to Spark cluster to execute your commands. It will start Spark Application with your first command.
Configuring Spark Session
You can specify Spark Session configuration in the session_configs section of the config.json or in the notebook by adding %%configure as a very first cell. Here is an example:
%%configure -f { "driverMemory": "1000M", "executorCores": 2, "numExecutors" : 2, "executorMemory" : "4g", "conf" : { "spark.qubole.max.executors":20 } }
With Jupyter, sparkmagic, and Livy, data scientists and other Python users can keep their existing Python code while benefitting from Qubole Spark cluster’s capabilities and analyze huge amounts of data stored in the cloud.
Qubole makes it easy to spin up clusters with different sizes and CPU/memory configurations to suit different workloads and budgets. If you have any questions about using Jupyter and Spark or would like share your use cases, please leave us your feedback at [email protected]