White Papers

TDWI Checklist - The Automation and Optimzation of Advanced Analytics Based on Machine Learning

Issue link: https://www.qubole.com/resources/i/1161239

Contents of this Issue

Navigation

Page 4 of 8

4 T DW I RESE A RCH tdwi.org T DW I C HEC K L IS T RE P O R T: T H E A U T O M AT I O N A N D O P T I M I Z AT I O N O F A D VA N C E D A N A LY T I C S B A S E D O N M L pick and choose among them to get the most appropriate storage, read characteristics, in-place processing, and economics for each analytics use case, including ML for predictive analytics. Finally, note that this large data management infrastructure does not exist for ML alone; it also supports other forms of analytics, reporting, data preparation, data exploration, self-service data practices, and even operational applications. Analytics users are trending toward data lakes and clouds. This report has already discussed the importance of training data—terabyte-scale data sets of diverse data consumed during the design phase of a machine learning solution. There are many ways to provision training data for machine learning. This data can come from multiple platforms in the extended data infrastructure— typically enterprise applications, data warehouses, IoT sources, clouds, and lakes. However, the trend is toward consolidating as much data as possible into a data lake. Furthermore, data lakes are trending toward elastic clouds for reasons of automation, optimization, and economics. • The data lake manages the massive volume of detailed source data that ML needs. In fact, that's what a data lake is—a large repository of raw data of which most is captured and persisted in its original state as it comes from a source system or stream. Such a raw data repository is nirvana for analytics users. They can return to the original data over and over to repurpose it as new business questions or analytics projects come along. This isn't possible with data warehouses, which contain mostly aggregated and calculated values for reporting. Another benefit of the data lake is that there's no need to move data. The data needed for ML is already in the lake and data sets generated by ML processing can be stored there as well. For these reasons, the data lake—though only a few years old—has quickly become the preferred big data store for discovery- oriented analytics ranging from self-service data preparation and visualization to data mining and machine learning. • The cloud is ideal for the demanding and unpredictable workloads of ML and other analytics. For example, the iterative reads of large data sets that are common with ML development would wreak havoc with traditional data warehouses and other relational environments. Clouds cope with these easily using elasticity, in the form of workload-aware auto-scaling, which marshals resources as workloads ramp up, then reallocates resources as loads subside. As another example, building a large data lake with traditional platforms is cost prohibitive, but a data lake on modern cloud storage is comparatively cheap. Furthermore, an on-premises data lake demands time-consuming and risky system integration, whereas cloud providers handle the system integration for you. For these reasons, the cloud has become the preferred medium for data lakes and all forms of analytics, including machine learning. CONSIDER MAKING APACHE SPARK YOUR PREFERRED ENGINE FOR MACHINE LEARNING NUMBER THREE Apache Spark is an open source clustered computing framework for fast and flexible large-scale data analytics. TDWI sees Apache Spark as an important new technology, especially for analytics, big data environments, and Web-based operations. Spark offers many benefits for analytics and machine learning: Spark integrates with many data platforms and related systems. These include platforms important to ML and other analytics such as Hadoop, cloud-based object storage, and popular cloud-based data warehouse platforms. Spark integrates with the rich ecosystem of Apache tools. Users can take advantage of multiple engines, tools, and languages, then select the best tool for a given use case, data set, or workload. Spark primitives operate in-memory. For example, Spark has a parallel data processing framework that places data in Resilient Distributed Data Sets (RDDs), a distributed data abstraction that scales to complex calculations with fault tolerance. This elimination of input and output (I/O) provides speed for iterative analytics (as with ML development) and modern data pipelining. Furthermore, once data is in Spark memory, many tools and applications can access it easily at high performance. Spark's architecture decouples compute and storage resources. This contributes to greater speed and scale for Spark clusters, especially compared to Hadoop clusters, which are not decoupled. Decoupling also gives developers greater flexibility with deployment designs. Furthermore, Spark clusters can be heterogeneous as well as homogeneous. Spark libraries are highly useful to data management and analytics specialists. This is especially true of Spark's libraries for standard SQL, GraphX, and machine learning (called MLLib).

Articles in this issue

Links on this page

view archives of White Papers - TDWI Checklist - The Automation and Optimzation of Advanced Analytics Based on Machine Learning