Practical Guide to Financial Governance of Data Lake Initiatives

Start Free Trial
November 12, 2019 by Updated March 27th, 2024

Introduction

Enterprises are today becoming more data-driven as their data is the fuel to their innovation engine to build new products, outmaneuver the competition and provide customers better experiences. As a result, big data management and processing for various stakeholders such as Data Analysts, Data Engineers, and Data Operations organizations should be fast, automated, and scalable.

Four common reasons for unpredictable unexpected bills

Enterprises doing multiple projects with big data use public cloud services for compute and storage. The public cloud provides agility and scale to execute these projects within a few clicks and broadly match supply with demand. Further, the cloud lets enterprises build and run the best breed big data processing systems. The services offered are all on-demand, pay-as-you-go services, therefore, letting ad-hoc analysis, and POCs driven by big data start easily and without any huge upfront bills. Over time as projects mature or ad hoc queries become longer, the seemingly endless supply of underlying resources leads to wasteful expenditure on compute and resources and very little accountability, guesstimate show back. The usage comes with cost unpredictability and lacks financial governance. This is associated with the following:

Long-Running Servers

Most application requests are driven by external clients which are not known in advance. As a result, the servers are kept in anticipation of these requests.

Performance Optimization

Typical web applications, serving external requests, are optimized to reduce latency rather than cost. As a result, the servers are provisioned well in advance, rather than on-demand and not tuned to adapt to changing workloads.


Qubole Credit Units Consumption Trends on Monthly, Hourly and Daily Basis

Figure 1: Qubole Credit Units Consumption Trends on Monthly, Hourly, and Daily Basis

Uniform Load Increase vs Burstiness

Most web applications’ underlying requirements are built on the assumption of a uniform increase and decrease of workload gradually over a period of time instead of an acute increase or tail-end decrease of resource consumption. For example, the burstiness of workloads spawning thousands of parallel tasks requires 1000’s servers/machines for a short duration of time not planned earlier. It also cannot be micromanaged to keep financial checks and balances.

Idle Period

Unlike web applications that have a steady flow of traffic 24/7 big data workloads can be scattered during a given day, leading to several blocks of the idle time period when there are no workloads.

Best practices to have financial governance on an ongoing basis

Data-driven enterprises nowadays face financial governance challenges on a regular basis as the number of big data projects using the public cloud internally has risen exponentially. Whilst traceability and predictability are important elements in financial governance policies, cost control, and expense reduction are usually the starting focus of any financial governance exercise. Enterprises take the following steps to bring financial governance:

Optimize for Performance

Enterprises optimize for performance which not only accounts for the speed of query execution but also considers the timeliness of the execution.

Prioritize Capacity Management as an ongoing exercise

Capacity management in the cloud is now about infrastructure utilization optimization with financial governance guard rails for teams to not only move fast on their projects but also not to worry about unexpected bills. The organization’s goal during the optimization is to build systems that constantly provide sufficient capacity to be slightly above that needed while maintaining the traceability and predictability of user, cluster, and job cost metrics levels. Organizations are doing the following:

  • Remove orphaned or unused infrastructure
    Remove infrastructure that was left behind when another infrastructure was terminated (eg disk volumes, ideally combined with auto-snapshot before deletion) or infrastructure that sits idle for a specified amount of time.
  • Resize under-utilized infrastructure
    Adjusts the size of infrastructure that has spare resources to an appropriate level. This requires careful policy creation as capacity has to take into account expected spikes in usage.
  • Lifecycle management of infrastructure based on schedules
    Automate the creation and destruction of systems to fit around usage patterns. For example, create development environments for use during office hours or extend production platforms during peak trading hours.
  • Cost Optimize with a heterogeneous environment
    Apply tooling generally DIY scripts to automate system management to use infrastructure with the best value whilst meeting the levels of resilience and availability required by the system.
  • Limited Traceability Management
    Apply rules at a cluster, cluster-instance, user, and job level to ensure policies are configured automatically to destroy any elements that are created that do not meet the tagging policy in place.
User based Breakdown of Resource Consumption image
Figure 2: User-based Breakdown of Resource Consumption

Adopting Data Platform with built-in Financial Governance Metrics

Enterprises now leverage platforms such as data platforms to drive financial governance within the organization to reduce operating costs, do an ROI analysis, have showback discussions,s and identify to spend assets. In addition to cluster lifecycle management, Qubole data services offer Workload Aware Autoscaling to strengthen the financial governance within an organization as multiple teams run big data in a shared cloud environment or separate ones which can be combined to deliver more savings without compromising performance:

Cost Explorer

Cost explorer provides user, job, and cluster-level cost metrics in a shared environment to provide data-driven show-back information for a fact-based discussion in the enterprise.


Cost Reporting with Qubole image

Figure 3: Cost Reporting with Qubole

Optimized Upscaling

Optimized Upscaling reclaims unused resources from the running containers in the cluster and allocates them to the pending containers. This improves the cluster throughput and reduces the delays due to provisioning machines while lowering the TCO and cost avoidance.

Aggressive Downscaling

Aggressive downscaling prevents cost overruns after the job is completed by shutting down idle nodes, rebalancing workloads across active nodes, and decommissioning idle ones without the risk of data loss.

Container Packing

Container Packing, a resource allocation strategy, makes more nodes available for downscaling in an elastic computing environment, while at the same time preventing hot spots in the cluster and honoring data locality preferences.

Diversified Spot

Qubole automatically provisions nodes from different Ec2 Instance types, to maximize the number of instances fulfilled by the cloud provider. Diversifying the instance types reduces the chances of bulk interruption of Spot nodes by Cloud Provider. When a spot node of one particular instance type is not available, Qubole automatically tries other instance types rather than falling back to on-demand.

Managed Spot Block

While Spot Block instances provide reliability for a finite duration(1-6 hours), AWS will recall these nodes after that duration leading to node loss(failure). To prevent this failure from impacting cluster operations or workloads, Qubole has built-in Intelligent Spot Block Management features that provide: Risk Mitigation, Impact Mitigation, and Fault Tolerance and ensures cluster operations can continue, without failure, beyond the finite duration of the spot block.

Summary

Enterprises can now leverage Qubole cloud-native data platform’s cost avoidance and TCO optimization features to have data-driven user, job, cluster, or cluster-instance level cost metrics discussions to measure ROI on their big data projects.

Start Free Trial
Read Addressing Regulatory GDPR and CCPA frameworks with Qubole ACID and Apache Ranger