White Papers

TDWI Checklist - The Automation and Optimzation of Advanced Analytics Based on Machine Learning

Issue link: https://www.qubole.com/resources/i/1161239

Contents of this Issue

Navigation

Page 5 of 8

5 T DW I RESE A RCH tdwi.org T DW I C HEC K L IS T RE P O R T: T H E A U T O M AT I O N A N D O P T I M I Z AT I O N O F A D VA N C E D A N A LY T I C S B A S E D O N M L ML users should look for tools with deep support of Spark. Look for tools that can automatically spawn Spark clusters. This automation simplifies and accelerates incorporating new sources and the use of their data assets. A tool should help Spark auto-scale based on workload recognition. This optimization simplifies resource allocation and management so scalability is reached sooner and more easily. It also assures ample resources for analytics workloads such as ML. A tool should provide extra security for the Spark cluster. For example, look for encrypted credentials. Many Spark users prefer open source notebooks. Look for tools that are compatible with Zeppelin and Jupyter. ACHIEVE SPEED AND SCALE—KEY REQUIREMENTS FOR ML DESIGN AND OPERATION—BY ADOPTING NEW ARCHITECTURES NUMBER FOUR Hadoop has served a useful purpose by being a big data platform that users could afford and learn on. However, dissatisfaction is mounting because of Hadoop's limitations in speed, security, metadata, and SQL compatibility. Furthermore, Hadoop architecture tightly couples compute and storage resources, which means you cannot scale up resources in one area without also doing so in the other. Compute and Storage Decoupled For better resource management and therefore better scaling, the current trend is toward big data platforms that decouple compute and storage. The trend is especially apparent in the growing adoption of other open source engines, such as Apache Spark and cloud-based object storage. When compute and storage are decoupled, they can be managed as separate resources. This allows compute and storage to scale and perform independently instead of being handcuffed together. In turn, this independence reduces limitations, resulting in greater speed and scale. Decoupling has positive ramifications for ML. The performance and scale improvements of decoupling empower ML algorithms and models to read and score larger data volumes within smaller timeframes. In other words, data teams can run more jobs with the same budget, thereby keeping total cost of ownership low. Furthermore, decoupling provides more flexibility in how system resources are managed, so that developers can be more innovative in designing solutions for analytics and ML. Apache Spark Architecture Spark's data platform architecture has scalability, low cost, and compatibility with the Apache tool ecosystem similar to Hadoop but without Hadoop's limitations. For example, Spark has the linear scalability of MapReduce but with high performance and low latency. Spark enables iterative development (an ML requirement) and ad hoc queries with big data that Hadoop users can only dream of. Spark's library architecture means it will grow into ever-broader functionality, including a library for ML. Spark can integrate with Hadoop to improve it, but Spark can also work with many other data platforms, including cloud storage. Cloud-Based Object Storage Cloud storage has matured recently to support strategies based on objects, blocks, or files and folders. Among these, object storage is preferred for content-driven interfacing, as is typical with machine learning and some other analytics methods. The value of object storage lies in simplicity (it's easy to work with), economy (object storage tends to be cheaper than block storage), and its not being coupled with compute (thereby removing limits to speed and scale). Spark and Object Store as an Integrated Architecture Data and analytics professionals have started integrating Spark and object storage. TDWI expects this to become a common architecture because both Spark and object storage have compelling functionality for analytics and data management (as discussed throughout this report) and both decouple compute and storage for maximum speed, scale, and flexibility. EMBRACE DATAOPS AND OTHER MODERN TEAM STRUCTURES THAT AFFECT MACHINE LEARNING SUCCESS NUMBER FIVE Until recently, data-driven development was slow, siloed, and misaligned. To get an ML model from inception to production, someone must collect data of interest for the project, explore data looking for insight, firm up a hypothesis based on what they discovered, collect more data based on the hypothesis, create an initial prototype of a model, get feedback about the prototype, iteratively evolve the model, collect even more data, get more feedback, iterate the prototype further, try out the model in a test system, revise the model accordingly, deploy the model in a production system, test again, revise again, and finally release the model (or its parent solution) to users.

Articles in this issue

Links on this page

view archives of White Papers - TDWI Checklist - The Automation and Optimzation of Advanced Analytics Based on Machine Learning