Qubole Education Spark 301 will discuss the structure of Qubole Yarn and the implications for managing resources and the Master Node of a cluster. The course will also provide an overview of Executor Optimization and Best Practices for implementation.
Qubole Education Spark 302 contains “Try It” sections which contain instructions that can be followed inside of a personal account if you have access to the ‘paid-qubole/default-datasets’ bucket on S3. Please contact your administrator if you have issues accessing the sample datasets from your personal accounts.
Qubole Education Spark 301 contains quiz questions after several of the lessons – it will be necessary to answer the quiz questions to complete the lessons and the course.
Estimated Time: 30 to 45 minutes
Spark Yarn Tuning
Yarn Cluster Management1 of 2
Yarn Cluster Structure Yarn requires memory to be set aside for overhead on both the Master and Slave nodes in order to manage the resources in the environment and the data processing. As a result it is important to consider overhead management when running into processing issues with Spark. It is also important to understand [...]
Yarn Master Node2 of 2
Master Node Responsibilities The Master Node of a Spark Cluster has a lot of responsibility and as a result it is important to ensure that ample resources are available to the Master Node. Notebook Interpreters also run as applications on the Master Node which can cause resource conflicts when there is heavy Notebook usage in [...]
Spark Executor Optimization
Executor Tuning1 of 7
Executor Tuning When tuning Spark keep in mind that memory usage is primarily split between execution, storage and overhead. Execution memory is used for tasks such as shuffles, joins, sorts and aggregations while storage memory is used for caching and moving data across the cluster. Alternatively memory can be split between memory used by objects, [...]
Executor Tuning: Heap Management2 of 7
Heap Management Executors are allocated a specific amount of memory in the Spark Configuration and this memory is distributed amongst various components. While it is important to make sure ample resources are reserved to manage cache and shuffle if there is not enough memory to support the Heap then processing many become very slow. If [...]
Executor Tuning: Cache Management3 of 7
Cache Management When caching RDD data inside of a Spark Cluster it is important to consider both how the data is cached and the resulting cache size. Cached RDD data can quickly increase in size when compared to the size of the source file.Serializing data can provide improved performance since less memory will be required [...]
Try It - RDD Serialization4 of 7
Overview You will create an RDD from a dataset stored in S3 using the .textFile() command, cache the RDD using the .cache() command and inspect the cache by calling the .getRDDStorageInfo command. You will then create a second cache using the .persist() command so that you can serialize the RDD. It will be necessary to import the StorageLevel [...]
Executor Tuning: Data Management5 of 7
Data Format Understanding and properly managing the data types and file formats used in Spark can lead to significant performance improvements and potential cost savings. Java objects are fast to access however these can easily grow between 200% and 500% of original file size. Additionally computation can be impacted when data formats are complicated to [...]
Executor Tuning: Garbage Collection6 of 7
Garbage Collection The resources required to complete garbage collection is proportional to the number of Java objects therefore using primitive objects will reduce the eventual cost of garbage collection. Persisting serialized RDDs in the cache will lead to simpler garbage collection since there is only a one byte array to clean up. Smaller cache sizers [...]