Admin 201 will feature an exploration of the business use cases that drive the selection of a cloud cluster and the preferred use cases by the clusters available in Qubole. The presentation will also discuss the documented recommended starting instances for the Clusters based on the use cases described.
Qubole Education Admin 201 contains quiz questions after several of the lessons – it will be necessary to answer the quiz questions to complete the lessons and the course.
Estimated Time: 45 to 60 minutes
Hadoop & Hive Clusters1 of 4
Hadoop Fundamentally, Hadoop is a parallel data processing platform that uses open source software, a distributed file system (HDFS), and the Map Reduce execution engine to store, manage, and process very large data sets in parallel across distributed clusters of commodity servers. Being that the MapReduce framework and the HDFS both run on the same [...]
Spark Clusters2 of 4
Spark Clusters Spark is a scalable, open source big data processing engine designed for fast and flexible analysis of big data. As a result Spark can quickly process large volumes of data because results can be persisted in-memory on Spark’s own distributed framework. Spark’s ability to handle both batch and streaming workloads as well as [...]
Presto Clusters3 of 4
Presto Clusters Presto is an open source distributed SQL query engine designed and written from the ground up with processing capabilities comparable to the speed of commercial data warehouses. Essentially Presto is a processing engine useful for interactive analytics and less complex queries performed mostly in memory. Presto provides an interesting middle ground between Map [...]
Customizing Clusters4 of 4
Customizing Clusters Advanced applications used by our customers often require custom software or packages to be installed in clusters as a prerequisite. Qubole provides administrators the ability to bootstrap the nodes in a cluster with custom scripts during startup by providing the location of a bash script used for installing custom software packages on cluster [...]
ETL Reliability (Cluster Comparison)1 of 10
ETL Reliability ETL workloads typically require massive volumes of unstructured data, large clusters and extended processing times. As a result enterprises tend to choose Hive with Hadoop for the engine and cluster pair used to process ETL batch workloads. Hive with Hadoop is a stable solution, in the event of a memory depletion during processing [...]
Interactive Analytics (Cluster Comparison)2 of 10
Interactive Analytics Spark is fast enough to perform exploratory queries without sampling and interfaces with a number of development languages including SQL, R, and Python. Big data consists of structured and unstructured data, each of which is queried differently. Spark SQL provides an SQL interface to Spark that allows developers to co-mingle SQL queries of [...]
Advanced Analytics (Cluster Comparison)3 of 10
Advanced Analytics The Spark Core features a component designed to easily integrate with and support streaming data and as a result Spark is often the preferred cluster type for this use case. Qubole features Spark Notebooks therefore in addition to leveraging the streaming component developers may use the Notebook interpreters to manipulate, process and analyze [...]
Temporary Data (Cluster Comparison)4 of 10
Temporary Data Hadoop and Spark can persist temporary data however each cluster has a different technical approach. Using Qubole Hive temporary tables can be created and associated with locally stored data inside of the Hadoop Distributed File System. When it comes to taking advantage of temporary data either Hive or Spark can be leveraged.
Caching Data (Cluster Comparison)5 of 10
Caching Users may explicitly persist data in memory using various features available inside of Spark and the associated languages.
SQL Queries (Cluster Comparison)6 of 10
SQL Queries There may be existing SQL workloads which need to be used to access cloud data or individuals and teams who would prefer to write SQL to drive data processing. Presto can provide performance benefits due to in memory processing for simple queries which do not feature multiple big joins and complex aggregations. However [...]
ODBC / JDBC / BI Tools (Cluster Comparison)7 of 10
ODBC / JDBC / BI Tools If there is a need to implement OBDC or JDBC connectivity then it will be necessary to consider the specific connectivity needs. All three clusters provide options for ODBC / JDBC connectivity to BI Tools however the exact methods and tools may differ across the solutions. Selecting the right [...]
Ease of Use (Cluster Comparison)8 of 10
Ease of Use When considering the appropriate cluster to select it is important to keep in mind how much user input is required to efficiently manage memory. Spark users are required to know whether the memory they have access to is sufficient for a dataset and adding more users further complicates this since the users [...]
Language Support (Cluster Comparison)9 of 10
Language Support Both Hadoop and Spark offer an array of options for query and command creation or workload and job submission. Hadoop supports a larger set of languages and query types than the other cluster options. However Spark also accepts SQL, R, Python, Scala and Command line submissions making it a very appealing option for [...]
Cluster Breakdown Review10 of 10
The following questions will test your comfort with the breakdown of use cases and the associated clusters.
Azure Instance Variations1 of 2
Instance Variations Cloud storage and compute vendors offer several instance types for customers to select for the nodes of a cluster. This flexibility is available because not all jobs and workloads require the same resources therefore a more intentional approach may be taken to infrastructure management and selection to accomplish the required work. Of course [...]
Cluster Instance Starting Point2 of 2
Cluster Instance Starting Point When creating a cluster the first step is selecting the type which will subsequently narrow the scope of the recommended instance types and therefore slave node selection. Typically the engine required to process the queries or commands will determine the cluster type. Upon selection of the cluster type the appropriate cloud [...]