Qubole Education Spark 201 will define why Spark is considered “lazy” while establishing the relationship between actions and transformations. The course will explore Spark execution stages, stage boundaries and the stage shuffle and will end with a look at Notebook Interpreters.
Qubole Education Spark 201 contains “Try It” sections which contain instructions that can be followed inside of a personal account if you have access to the ‘paid-qubole/default-datasets’ bucket on S3. Please contact your administrator if you have issues accessing the sample datasets from your personal accounts.
Qubole Education Spark 201 contains quiz questions after several of the lessons – it will be necessary to answer the quiz questions to complete the lessons and the course.
Estimated Time: 60 to 75 minutes
Actions & Transformations
Actions & Transformations1 of 2
Lazy Spark Spark is “lazy” by design to more efficiently manage resources, workflows and data within the cluster and across the executors and their associated tasks on the Slave nodes. When creating Spark workloads keep in mind that there are two flavors of input - transformations and actions. When a job is submitted to the [...]
Try It - Actions & Transformations2 of 2
Overview You will create an RDD from a dataset stored in S3 using the .textFile() command, which is a transformation and will then cache the RDD using the .cache() command which is also a transformation. You will then inspect the cache by calling the sc.getRDDStorageInfo command which will return an array of cached RDDs. Afterwards you [...]
Stages & Shuffle
Spark Stages1 of 4
Spark Stages Spark breaks down jobs into stages based on the transformations and actions and these stages determine the processing order and required data shuffling. With the stages established Spark will design an execution plan for completing the transformations, distributing the data across the nodes in the cluster and processing the actions. Stage Boundaries Spark will attempt [...]
Spark Shuffle2 of 4
Spark Shuffle If the data required for a transformation or action does not reside in the same partition in the cluster then Spark will move data around the cluster. This is known as shuffle and is a key part of Spark execution - this method allows Spark to place all tuples with the same key into [...]
Managing Shuffle3 of 4
Triggering Shuffle Stage boundaries are expensive and may consume a large amount of system resources therefore try to avoid them if possible. Therefore when composing workloads consider the selected operations as well as the order to manage the evaluation plan and avoid shuffle. Spark will also shuffle when the number of partitions between stages is [...]
Try It - Stages & Shuffle4 of 4
Overview You will create an RDD from a dataset stored in S3 using the .textFile() command, which is a transformation and perform several actions and transformations with the data: Break the text into individual records based on the space between words. Remove all instances of supporting text - commas, periods, underscores, etc. Count the number [...]
Notebook Interpreters1 of 2
Notebook Interpreters Qubole Spark Notebooks launch individual Spark applications on the Master Node in response to use queries and commands. By default all notebooks share the underlying applications however developers and administrators may configure their own for a single notebook or a group of notebooks. This flexibility not only gives customers the ability to manage [...]
Try It - Notebook Interpreter2 of 2
Overview You will create a new Interpreter, assign it to a Notebook and review the associated resource allocation through the Spark Application UI. You will then answer the associated quiz questions based on the behavior observed. Code Navigate to the Notebook interface, select the Plus Sign + to create a new Notebook and set spark as the [...]