White Papers

TDWI Checklist - The Automation and Optimzation of Advanced Analytics Based on Machine Learning

Issue link: https://www.qubole.com/resources/i/1161239

Contents of this Issue


Page 3 of 8

3 T DW I RESE A RCH tdwi.org T DW I C HEC K L IS T RE P O R T: T H E A U T O M AT I O N A N D O P T I M I Z AT I O N O F A D VA N C E D A N A LY T I C S B A S E D O N M L GIVE MACHINE LEARNING A DATA ENVIRONMENT THAT IS AS DIVERSE AS IT IS BIG NUMBER ONE Machine learning has a voracious appetite for data during both development and production, which makes unique demands of an organization's data management infrastructure. • Machine learning is small but powerful. ML algorithms and models are tiny compared to the vastness of big data and the multiplatform infrastructure required to capture and leverage it. The broader the data (in terms of its sources and the entities represented), the more comprehensive the model is. Hence investments in data management are worthwhile because ML provides insights that raise the ROI of programs for big data, analytics, data warehousing, data lakes, and so on. Furthermore, when organizations already have a big data infrastructure, adding ML extends the life cycle and business value of that infrastructure. • Data management infrastructure can be vast. It can, for example, include platforms and tools for data warehousing, data lakes, data integration, data preparation, multiple forms of analytics, and big data. New data platforms are emerging as well, dominated by clouds; open source engines, libraries, and languages (for example, Apache Spark); and self-service tools. That is a long list of platforms, technologies, and processing engines. Yet, it is all required for modern organizations that want to operate and compete on analytics and intelligence. • Each form of analytics (including ML) has its own data requirements. First, savvy organizations are deploying tools for multiple types of analytics (not just machine learning) because each type tells them something unique and valuable. Second, each analytics approach needs data that is prepared and presented in a certain way so that an analytics tool has data in a schema, a quality condition, and on a data platform optimal for the analytics tool or the user practice involved. For example, machine learning algorithms are almost always optimized for raw detailed source data. Thus, the data environment must provision large quantities of raw data because that is required for discovery-oriented analytics practices such as data exploration, data mining, statistics, and machine learning. • ML needs data from diverse sources, in diverse formats, about diverse business processes. For the most comprehensive learning experience, a data management infrastructure should provide diverse training data—integrated from multiple, diverse sources and concerning various business entities—to make algorithmic assessments more real-world, accurate, and successful in production. To support machine learning successfully, the data management infrastructure needs speed and scale. • Scale to large volumes of training and test data: A common reason for model failure in production is the lack of proper training data, which needs to be massive, diverse, and from real-world processes. • Speed and scale for analytics processing: Machine learning is very good at finding patterns in large amounts of data. For instance, models built with machine learning can establish baseline profiles for various entities (e.g., authorized versus unauthorized transactions) and then predict which users or accounts are likely to present unauthorized transactions. This approach must scale and perform with both baseline creation and production comparisons, even in multiterabyte data environments. • Speed for agile, iterative development: Machine learning design usually requires an iterative process, where a developer tweaks an algorithm and reruns it immediately. The data environment must perform with low latency, even with large data sets and distributed queries that hit multiple systems. • Real-time data capture for real-world processes: Some machine learning solutions compare the latest data to a predictive model as the data streams in real time. This requires that the data management infrastructure include special tools and platforms for streams or event processing. EMBRACE DATA TECHNOLOGIES AND APPROACHES THAT ARE KEY TO MACHINE LEARNING SUCCESS, NAMELY DATA LAKES AND CLOUDS NUMBER TWO Multitechnology infrastructure has become the norm for data environments, especially in analytics-driven organizations. This is because it takes diverse platforms, processing engines, tools, and technologies to satisfy the requirements of diverse data and diverse analytics. The result is today's multiplatform data management infrastructure, which—in modern firms—includes a mix of big data platforms, clouds, data lakes, and data warehouses. This portfolio of diverse data engines and tools is increasingly hybrid in the sense that some systems and data are on clouds and others are on-premises. All these data platforms work together in today's hybrid, multitechnology data environments and users can

Articles in this issue

Links on this page

view archives of White Papers - TDWI Checklist - The Automation and Optimzation of Advanced Analytics Based on Machine Learning