Blog

×

Deep Learning on Qubole Using BigDL for Apache Spark – Part 1

By July 27, 2017

BigDL runs natively on Apache Spark, and because Qubole offers a greatly enhanced and optimized Spark as a service, it makes for a perfect deployment platform.

In this Part 1 of a two-part series, you will learn how to get started with distributed Deep Learning library BigDL on Qubole. By the end, you will have BigDL installed on a Spark cluster with a distributed Deep Learning library readily available for you to use in your Deep Learning applications running on Qubole.

In Part 2, you will learn how to write a Deep Learning application on Qubole that uses BigDL to identify handwritten digits (0 to 9) using a LeNet-5 (Convolutional Neural Networks) model that you will train and validate using MNIST database.

Before we get started, here’s some introduction and background on the technologies involved.

What is Deep Learning?

Deep learning is a form of Machine Learning that uses a model of computing very much inspired by structure of the brain. It is a kind of Machine Learning that allows computers to improve with data and achieve great flexibility by learning to represent the world as a nested hierarchy of concepts.

In early talks on Deep Learning, Andrew Ng described it in the context of traditional artificial neural networks. In his talk titled “Deep Learning, Self-Taught Learning and Unsupervised Feature Learning”, he described the idea of Deep Learning as:

Using brain simulations, hope to:
– Make learning algorithms much better and easier to use.
– Make revolutionary advances in machine learning and AI.

I believe this is our best shot at progress towards real AI.

 

So, What is BigDL?

BigDL is a distributed deep learning library created and open sourced by Intel. It was designed from the ground up to run natively on Apache Spark, and therefore enables data engineers and scientists to write deep learning applications as standard Spark programs–without having to explicitly manage distributed computations.

  • Rich Deep Learning Support
    • Modeled after Torch, BigDL provides comprehensive support for deep learning including numeric computing via Tensor and high level neural networks.
    • In addition, users can load pre-trained Caffe or Torch models into Spark applications using BigDL.
  • Extremely High Performance
    • BigDL uses Intel MKL (Math Kernel Library) and multi-threaded programming within each Spark task. Consequently, it is orders of magnitude faster than out-of-box open source Caffe, Torch or TensorFlow on a single-node Xeon — which is comparable to mainstream GPU instance.
  • Efficient Scaling
    • BigDL can efficiently scale out to perform data analytics at “Big Data scale” by leveraging Apache Spark, efficient implementations of synchronous SGD as well as all-reduce communications on Spark.

For more details on BigDL, click here.

And, Why BigDL on Qubole?

BigDL runs natively on Apache Spark, and because Qubole offers a greatly enhanced and optimized Spark as a service, it makes for a perfect deployment platform.

Highlights of Apache Spark as a service offered on Qubole

  • Auto-scaling Spark Clusters
    • In the open source version of auto-scaling in Apache Spark, the required number of executors for completing a task are added in multiples of two. In Qubole, we’ve enhanced the auto-scaling feature to add required number of executors based on configurable SLA
    • With Qubole’s auto-scaling, cluster utilization is matched precisely to the workloads, so there are no wasted compute resources and it also leads to lowered TCO. Based on our benchmark on performance and cost savings, we estimate that auto-scaling saves a Qubole’s customer over $300K per year for just one cluster.
  • Heterogeneous Spark Clusters on AWS
    • Qubole supports heterogeneous Spark clusters for both On-Demand and Spot instances on AWS. This means that the slave nodes in Spark clusters may be of any instance type.
    • For On-Demand nodes, this is beneficial in scenarios when the requested number of primary instance type nodes are not granted by AWS at the time of request. For Spot nodes, it’s advantageous when either the Spot price of primary slave type is higher than the Spot price specified in the cluster configuration or the requested number of Spot nodes are not granted by AWS at the time of request.
  • Optimized Split Computation for Spark SQL
    • We’ve implemented optimization with regards to AWS S3 listings which enables split computations to run significantly faster on Spark SQL queries. As a result, we’ve recorded up to 6X and 81X improvements on query execution and AWS S3 listings respectively.

To learn more about Qubole, click here.

Getting started with BigDL on Qubole

Prerequisites

Imp: Copy/upload BigDL jar, MNIST data files (train-images-idx3-ubyte, train-labels-idx1-ubyte, t10k-images-idx3-ubyte and t10k-labels-idx1-ubyte) and test images to S3 bucket that can be accessed from a remote shell script. (These files will need to be downloaded on the cluster via bootstrap script.)

Steps

#1 If you don’t have a Spark cluster configured for this application, click here for instructions on how to configure one.
#2 Then, on Clusters page, select/scroll down to the Spark cluster of your choice and click on Edit.
#3 On Edit Cluster Settings page, click on 4. Advanced Configuration tab.
#4 Scroll down to SPARK SETTINGS section and copy-and-paste the following in Override Spark Configuration.

 spark-defaults.conf:
 spark.executorEnv.DL_ENGINE_TYPE mklblas
 spark.executorEnv.MKL_DISABLE_FAST_MM 1
 spark.executorEnv.KMP_BLOCKTIME 0
 spark.executorEnv.OMP_WAIT_POLICY passive
 spark.executorEnv.OMP_NUM_THREADS 1
 spark.yarn.appMasterEnv.DL_ENGINE_TYPE mklblas
 spark.yarn.appMasterEnv.MKL_DISABLE_FAST_MM 1
 spark.yarn.appMasterEnv.KMP_BLOCKTIME 0
 spark.yarn.appMasterEnv.OMP_WAIT_POLICY passive
 spark.yarn.appMasterEnv.OMP_NUM_THREADS 1
 spark.shuffle.reduceLocality.enabled false
 spark.shuffle.blockTransferService nio
 spark.scheduler.minRegisteredResourcesRatio 1.0
 spark.executor.instances 4
 spark.qubole.max.executors 4
Note: These parameters are required by BigDL and setting them here will make them available to Spark driver and executors across existing nodes as well as any new nodes that are added during auto-scaling in Qubole.

#5 Save the cluster settings and configuration by clicking on Update. At this point, you should be back on the main Cluster page.
#6 Click on the dotted (…) menu all the way to the right and select Edit Node Bootstrap.
#7 Copy-and-paste the following script:

#!/bin/bash
source /usr/lib/hustler/bin/qubole-bash-lib.sh
make-python2.7-system-default

mkdir -p /media/ephemeral0/bigdl
mkdir -p /media/ephemeral0/bigdl/mnist
mkdir -p /media/ephemeral0/bigdl/mnist/data
mkdir -p /media/ephemeral0/bigdl/mnist/model

is_master=`nodeinfo is_master`
if [[ "$is_master" == "1" ]]; then
   echo "Setting BigDL env variables in usr/lib/zeppelin/conf/zeppelin-env.sh"
   echo "export DL_ENGINE_TYPE=mklblas" >> usr/lib/zeppelin/conf/zeppelin-env.sh
   echo "export KMP_BLOCKTIME=0" >> usr/lib/zeppelin/conf/zeppelin-env.sh
   echo "export MKL_DISABLE_FAST_MM=1" >> usr/lib/zeppelin/conf/zeppelin-env.sh
   echo "export OMP_NUM_THREADS=1" >> usr/lib/zeppelin/conf/zeppelin-env.sh
   echo "export OMP_WAIT_POLICY=passive" >> usr/lib/zeppelin/conf/zeppelin-env.sh

   echo "Restarting Zeppelin daemon"
   /usr/lib/zeppelin/bin/zeppelin-daemon.sh restart

   echo "Downloading test digit image files from s3://YOUR_S3_BUCKET"
   hadoop dfs -get s3://YOUR_S3_BUCKET/two_28x28.png /media/ephemeral0/bigdl
   hadoop dfs -get s3://YOUR_S3_BUCKET/three.png /media/ephemeral0/bigdl
   hadoop dfs -get s3://YOUR_S3_BUCKET/four_28x28.png /media/ephemeral0/bigdl
   hadoop dfs -get s3://YOUR_S3_BUCKET/seven.png /media/ephemeral0/bigdl
   hadoop dfs -get s3://YOUR_S3_BUCKET/nine_28x28.png /media/ephemeral0/bigdl
   hadoop dfs -get s3://YOUR_S3_BUCKET/zero_28x28.png /media/ephemeral0/bigdl
fi

echo "Downloading BigDL jar from s3://YOUR_S3_BUCKET"
sudo hadoop dfs -get s3://YOUR_S3_BUCKET/bigdl-[VERSION]-SNAPSHOT-jar-with-dependencies.jar /usr/lib/spark/lib

echo "Downloading mnist data files from s3://YOUR_S3_BUCKET"
hadoop dfs -get s3://YOUR_S3_BUCKET/t10k-images-idx3-ubyte /media/ephemeral0/bigdl/mnist/data
hadoop dfs -get s3://YOUR_S3_BUCKET/t10k-labels-idx1-ubyte /media/ephemeral0/bigdl/mnist/data
hadoop dfs -get s3://YOUR_S3_BUCKET/train-images-idx3-ubyte /media/ephemeral0/bigdl/mnist/data
hadoop dfs -get s3://YOUR_S3_BUCKET/train-labels-idx1-ubyte /media/ephemeral0/bigdl/mnist/data
Imp: Replace YOUR_S3_BUCKET with S3 bucket in your AWS account where you uploaded BigDL jar and MNIST data files and also replace bigdl-[VERSION]-SNAPSHOT-jar-with-dependencies.jar with your BigDL jar. Here is what’s happening in the above bootstrap script:

  • Set Python 2.7 as system default
  • Create temp directories that are accessed by our application
  • Recall setting BigDL environment variables for Spark in previous step. Similarly, we need to make those available to Zeppelin driver running on the master node
  • Download test images we will use in our Spark application
  • Download BigDL jar so it’s available for us to import in our Spark application
  • Download MNIST dataset that we will use to train model in our application

#8 Click Save to save the bootstrap script.
#9 Click on Start to bring up the cluster.

That’s it!

Once the Spark cluster comes up, BigDL deep learning library will be readily available for you to use in your Spark application running on Qubole.

See you in Part 2!

Share our Post

Comments are closed.