Deep Learning on Qubole Using BigDL for Apache Spark – Part 1

Start Free Trial
July 27, 2017 by Updated April 15th, 2024

BigDL runs natively on Apache Spark, which makes for a perfect deployment platform because Qubole offers a greatly enhanced and optimized Spark as a service.

In this Part 1 of a two-part series, you will learn how to get started with distributed Deep Learning library BigDL on Qubole. By the end, you will have BigDL installed on a Spark cluster with a distributed Deep Learning library readily available for you to use in your Deep Learning applications running on Qubole.

In Part 2, you will learn how to write a Deep Learning application on Qubole that uses BigDL to identify handwritten digits (0 to 9) using a LeNet-5 (Convolutional Neural Networks) model that you will train and validate using the MNIST database.

Before we get started, here are some introduction and background on the technologies involved.

What is Deep Learning?

Deep learning is a form of Machine Learning that uses a model of computing very much inspired by the structure of the brain. It is a kind of Machine Learning that allows computers to improve with data and achieve great flexibility by learning to represent the world as a nested hierarchy of concepts.

In early talks on Deep Learning, Andrew Ng described it in the context of traditional artificial neural networks. In his talk titled “Deep Learning, Self-Taught Learning, and Unsupervised Feature Learning”, he described the idea of Deep Learning as:

Using brain simulations, hope to:
– Make learning algorithms much better and easier to use.
– Make revolutionary advances in machine learning and AI.

I believe this is our best shot at progress towards real AI.

So, What is BigDL?

BigDL is a distributed deep learning library created and open-sourced by Intel. It was designed from the ground up to run natively on Apache Spark, and therefore enables data engineers and scientists to write deep learning applications as standard Spark programs–without having to explicitly manage distributed computations.

  • Rich Deep Learning Support
    • Modeled after Torch, BigDL provides comprehensive support for deep learning including numeric computing via Tensor and high-level neural networks.
    • In addition, users can load pre-trained Caffe or Torch models into Spark applications using BigDL.
  • Extremely High Performance
    • BigDL uses Intel MKL (Math Kernel Library) and multi-threaded programming within each Spark task. Consequently, it is orders of magnitude faster than out-of-box open-source Caffe, Torch, or TensorFlow on a single-node Xeon — which is comparable to mainstream GPU instance.
  • Efficient Scaling
    • BigDL can efficiently scale out to perform data analytics at the “Big Data scale” by leveraging Apache Spark, efficient implementations of synchronous SGD as well as all-reduce communications on Spark.

For more details on BigDL, click here.

And, Why BigDL on Qubole?

BigDL runs natively on Apache Spark, and because Qubole offers a greatly enhanced and optimized Spark as a service, it makes for a perfect deployment platform.

Highlights of Apache Spark as a service offered on Qubole

  • Auto-scaling Spark Clusters
    • In the open-source version of auto-scaling in Apache Spark, the required number of executors for completing a task is added in multiples of two. In Qubole, we’ve enhanced the auto-scaling feature to add the required number of executors based on configurable SLA
    • With Qubole’s auto-scaling, cluster utilization is matched precisely to the workloads, so there are no wasted compute resources and it also leads to lowered TCO. Based on our benchmark on performance and cost savings, we estimate that auto-scaling saves Qubole’s customers over $300K per year for just one cluster.
  • Heterogeneous Spark Clusters on AWS
    • Qubole supports heterogeneous Spark clusters for both On-Demand and Spot instances on AWS. This means that the slave nodes in Spark clusters may be of any instance type.
    • For On-Demand nodes, this is beneficial in scenarios when the requested number of primary instance type nodes is not granted by AWS at the time of the request. For Spot nodes, it’s advantageous when either the Spot price of the primary slave type is higher than the Spot price specified in the cluster configuration or the requested number of Spot nodes is not granted by AWS at the time of the request.
  • Optimized Split Computation for Spark SQL
    • We’ve implemented optimization with regards to AWS S3 listings which enables split computations to run significantly faster on Spark SQL queries. As a result, we’ve recorded up to 6X and 81X improvements on query execution and AWS S3 listings respectively.

To learn more about Qubole, click here.

Getting started with BigDL on Qubole

Prerequisites

Imp: Copy/upload BigDL jar, MNIST data files (train-images-idx3-ubyte, train-labels-idx1-ubyte, t10k-images-idx3-ubyte, and t10k-labels-idx1-ubyte) and test images to S3 bucket that can be accessed from a remote shell script. (These files will need to be downloaded on the cluster via bootstrap script.)

Steps

#1 If you don’t have a Spark cluster configured for this application, click here for instructions on how to configure one.
#2 Then, on the Clusters page, select/scroll down to the Spark cluster of your choice and click on Edit.
#3 On the Edit Cluster Settings page, click on 4. Advanced Configuration tab.
#4 Scroll down to the SPARK SETTINGS section and copy and paste the following into Override Spark Configuration.

 spark-defaults.conf: spark.executorEnv.DL_ENGINE_TYPE mklblas spark.executorEnv.MKL_DISABLE_FAST_MM 1 spark.executorEnv.KMP_BLOCKTIME 0 spark.executorEnv.OMP_WAIT_POLICY passive spark.executorEnv.OMP_NUM_THREADS 1 spark.yarn.appMasterEnv.DL_ENGINE_TYPE mklblas spark.yarn.appMasterEnv.MKL_DISABLE_FAST_MM 1 spark.yarn.appMasterEnv.KMP_BLOCKTIME 0 spark.yarn.appMasterEnv.OMP_WAIT_POLICY passive spark.yarn.appMasterEnv.OMP_NUM_THREADS 1 spark.shuffle.reduceLocality.enabled false spark.shuffle.blockTransferService nio spark.scheduler.minRegisteredResourcesRatio 1.0 spark.executor.instances 4 spark.qubole.max.executors 4
Note: These parameters are required by BigDL and setting them here will make them available to Spark driver and executors across existing nodes as well as any new nodes that are added during auto-scaling in Qubole.

#5 Save the cluster settings and configuration by clicking on Update. At this point, you should be back on the main Cluster page.
#6 Click on the dotted (…) menu all the way to the right and select Edit Node Bootstrap.
#7 Copy and paste the following script:

#!/bin/bash source /usr/lib/hustler/bin/qubole-bash-lib.sh make-python2.7-system-default mkdir -p /media/ephemeral0/bigdl mkdir -p /media/ephemeral0/bigdl/mnist mkdir -p /media/ephemeral0/bigdl/mnist/data mkdir -p /media/ephemeral0/bigdl/mnist/model is_master=`nodeinfo is_master` if [[ "$is_master" == "1" ]]; then    echo "Setting BigDL env variables in usr/lib/zeppelin/conf/zeppelin-env.sh"   echo "export DL_ENGINE_TYPE=mklblas" >> usr/lib/zeppelin/conf/zeppelin-env.sh    echo "export KMP_BLOCKTIME=0" >> usr/lib/zeppelin/conf/zeppelin-env.sh   echo "export MKL_DISABLE_FAST_MM=1" >> usr/lib/zeppelin/conf/zeppelin-env.sh   echo "export OMP_NUM_THREADS=1" >> usr/lib/zeppelin/conf/zeppelin-env.sh   echo "export OMP_WAIT_POLICY=passive" >> usr/lib/zeppelin/conf/zeppelin-env.sh    echo "Restarting Zeppelin daemon"   /usr/lib/zeppelin/bin/zeppelin-daemon.sh restart    echo "Downloading test digit image files from s3://YOUR_S3_BUCKET"   hadoop dfs -get s3://YOUR_S3_BUCKET/two_28x28.png /media/ephemeral0/bigdl   hadoop dfs -get s3://YOUR_S3_BUCKET/three.png /media/ephemeral0/bigdl   hadoop dfs -get s3://YOUR_S3_BUCKET/four_28x28.png /media/ephemeral0/bigdl   hadoop dfs -get s3://YOUR_S3_BUCKET/seven.png /media/ephemeral0/bigdl   hadoop dfs -get s3://YOUR_S3_BUCKET/nine_28x28.png /media/ephemeral0/bigdl   hadoop dfs -get s3://YOUR_S3_BUCKET/zero_28x28.png /media/ephemeral0/bigdl
fi echo "Downloading BigDL jar from s3://YOUR_S3_BUCKET" sudo hadoop dfs -get s3://YOUR_S3_BUCKET/bigdl-[VERSION]-SNAPSHOT-jar-with-dependencies.jar /usr/lib/spark/lib echo "Downloading mnist data files from s3://YOUR_S3_BUCKET" hadoop dfs -get s3://YOUR_S3_BUCKET/t10k-images-idx3-ubyte /media/ephemeral0/bigdl/mnist/data hadoop dfs -get s3://YOUR_S3_BUCKET/t10k-labels-idx1-ubyte /media/ephemeral0/bigdl/mnist/data hadoop dfs -get s3://YOUR_S3_BUCKET/train-images-idx3-ubyte /media/ephemeral0/bigdl/mnist/data hadoop dfs -get s3://YOUR_S3_BUCKET/train-labels-idx1-ubyte /media/ephemeral0/bigdl/mnist/data

Imp: Replace YOUR_S3_BUCKET with the S3 bucket in your AWS account where you uploaded the BigDL jar and MNIST data files and also replace bigdl-[VERSION]-SNAPSHOT-jar-with-dependencies.jar with your BigDL jar. Here is what’s happening in the above bootstrap script:

  • Set Python 2.7 as the system default
  • Create temp directories that are accessed by our application
  • Recall setting BigDL environment variables for Spark in the previous step. Similarly, we need to make those available to the Zeppelin driver running on the master node
  • Download test images we will use in our Spark application
  • Download BigDL jar so it’s available for us to import in our Spark application
  • Download MNIST dataset that we will use to train model in our application

#8 Click Save to save the bootstrap script.
#9 Click on Start to bring up the cluster.

That’s it!

Once the Spark cluster comes up, BigDL deep learning library will be readily available for you to use in your Spark application running on Qubole.

See you in Part 2!

Start Free Trial
Read Sharing Notebooks as Parameterized Dashboards