Importance of A Modern Cloud Data Lake Platform In today’s Uncertain Market

Start Free Trial
November 23, 2020 by Updated March 21st, 2024

In the last few months, the COVID-19 crisis has set a wake-up call for organizations and administrations that a new set of digital capabilities is needed to underpin new plans, policies, and a model of addressing the challenge and opportunities. COVID-19 has exposed a set of enterprise intelligence deficiencies and opportunities at the executive level that a new generation of chief data officers must address.

In a study conducted by IDC surveying 157 Chief Experience Officers (CXOs) from the US, 87 percent of respondents believe that enterprise intelligence is a key priority for the next five years. These CXOs are basically asking for the need to move from using data to analyze performance to using insights to affect performance in these turbulent times.

Gone are the days of building a data warehouse or a data lake because ‘we need to invest in Big Data’. Today’s market environment dictates a use case and an ROI-driven approach to business analytics, including investment and deployment in an analytic data management platform. A use case for business analytics exists anytime any decision by anyone is being made within an organization.

The data management team, like other users within the organization, also needs functionality for usage monitoring to be able to analyze user-interaction and system performance data for uncovering past trends, report on this analysis, predict future outcomes and optimize system administration and management decisions taken by either humans or the machine.

While data warehousing remains an important technology component of a comprehensive analytic data management platform, its capabilities must be extended with the latest cloud-native capabilities for supporting analysts and data scientists tasked with the cross-functional analysis of data from multiple internal and external sources, data arriving in batches and streams, and data residing in the cloud and on-premises.

The reality for many data engineering teams is that they not only must maintain data warehouses, data marts, and data lakes, but also must develop and maintain connections among all these various data management solutions, both in the cloud and on-premises.

Need For A Modern Architecture

Modern analytic data management needs and requirements are driven by elevated awareness of the value of analytics in today’s uncertain market; growing volume, variety, and velocity of data; and availability of new data analysis and Artificial Intelligence and Machine Learning tools and techniques, many of which are based on open source technology.

A modern analytic data management platform should not only support the requirements of data scientists for various advanced analytics techniques but also incorporate some of those techniques within its own operations. Additionally, it should support openness to the use of multiple storage formats, multiple data science and analytics tools and languages, and various AI frameworks, as well as the ability to develop multiple upstream and downstream data pipelines via connectors and Application Programming Interfaces (APIs) and the ability to extend the solution by internal IT staff who are able to rely on industry-standard development tools and techniques.

Today’s analytic data platforms must be able to scale to handle dozens of terabytes of data daily. For variable use cases, the platform must be able to scale up and down as needed to provide optimal price/performance outcomes. Also, given today’s constraints on IT budgets and staffing, the platform should incorporate automation capabilities that release data engineers and system administrators from certain mundane daily tasks, allowing the technical staff to focus on higher value-added projects.

A modern analytic data management platform should include capabilities such as:

  • Workload (and service-level-agreement) – aware auto-scaling for downscaling and upscaling
  • Intelligent spot management
  • Dynamic workload packing
  • Automated cluster life-cycle management
  • Multitenant and single-tenant cluster management for applications

The platforms need to strike a balance between available and proven automation techniques and human experience and expertise that can add contextual intelligence still not grasped by today’s ‘intelligent machines.’

Considering Qubole

As a data lake software provider, Qubole provides customers with a modern data lake platform. From our origin, we have embraced open-source componentry, and today we have incorporated Spark, Presto, Hive, and Airflow. As a data lake software provider, Qubole itself doesn’t provide the underlying storage and compute infrastructure; instead, that infrastructure is provided by the cloud platform provider such as Google Cloud, AWS, and Azure. All Qubole customers access its technology via Application Programming Interfaces (APIs) or native user interfaces.

Qubole provides a module called Cost Explorer, an internal system and usage performance analysis and reporting tool that data teams can use to track spending with fine granularity, monitor the show back, and budget and allocate costs in support of their ongoing decision making. Combined, these monitoring and analysis features underpin Qubole’s capability to automate myriad data engineering and administration tasks and activities. This automation provides organizations with valuable leverage in deploying their overextended data engineering teams, thus increasing the productivity of these IT teams and data science and analytics teams.

As the Associate Vice President (AVP) of data engineering at the digital commerce company said, “Qubole running on Google Cloud gives a much better ability to automatically start and terminate jobs.”

Start Free Trial
Read Data Lake Use Cases