Last week, we welcomed our customers Justin Wainwright, Systems Analyst at Oracle Data Cloud and Rajit Saha, Director of Data Platform at LendingClub, to discuss how their organizations are optimizing cloud costs associated with Data Analytics and Machine Learning during the current business climate.
During this panel discussion, Justin and Rajit shared their best practices for the short- and long-term, including:
- Top tips for optimizing data processing costs in the cloud
- Tactics identifying and eliminating wasteful spend
- Best practices for implementing financial governance with an open data lake
We can guarantee that if you implement even 25% of Rajit and Justin’s best practices, you’ll begin to optimize your cost-savings. Read on for a condensed, curated version of our conversation.
Total cost of ownership (TCO) was always important for analytics workloads. Over the last one year and in particular, with the recent changes in the macroeconomic environment, how critical is TCO for your organization?
Rajit: In a public financial enterprise, data infrastructure falls under the IT Cost Center. So, every dollar we use should be monitored thoroughly and it would be analyzed against what is the return of investment, either in OPEX or CAPEX. With any macroeconomic environment changes the stress of managing and manipulating TCO becomes more critical.
Justin: What we’ve noticed, especially with the COVID-19 pandemic…a lot of people are logging on first thing in the morning, and queuing up a dozen or so jobs basically to run and they’ll check on them later. The problem is everybody’s doing that all first thing in the morning, where we used to have a more sustained period of submitting jobs throughout the day.
Now we’re seeing more bursting periods like first thing in the morning, around noon time, and again right before the end of the day. What can happen with that is, in some cases people aren’t monitoring those jobs. They’re just letting them go. Normal workloads are still processing with our normal automated pipelines, and clusters that weren’t over committed before are now hitting their ceilings. So again, you make do with what you can, but that’s an emphasis now on monitoring and ad hoc reporting and cost analysis that wasn’t as stressed before, is even more apparent now.
TIP 1: Visibility both at business ROI level and everyday usage behaviour is required.
How has the pandemic altered the way you approach TCO?
Justin: For us, it’s always about on-demand vs spot allocation. Our standard practice is to launch clusters with only 10% min nodes as on-demand for stability, and use spot for all other nodes if possible. If clusters are starting to use 25, 50, possibly even 80% on-demand instead, we need to figure out why.
We’ve worked a lot with AWS over the past few months to better utilize spot fleet with capacity optimized spot allocation to get and keep spot nodes with greater success. From a usage standpoint, executor tuning is probably our biggest challenge- when a small analytics dev team copies code from one of our top 3 resource consumers into their cluster that results in a 300% usage spike, we need to identify that sooner rather than later. We use the Usage UI to track cluster spikes over time and look for clusters that either stay at the max for the majority of their uptime or go through radical idle/busy spikes when a more sustained workload could be used instead.
Lastly, we track Qubole Compute Units (QCU) and compute costs over time and make sure that rises/falls correlate- if your compute costs are going up while QCU is going down, odds are you are overprovisioned or should use different instances for those workloads.
TIP 2: Double down on your best practices for financial governance. Here: Increase spot savings percentage without comprising SLAs. Look for data lake platforms which can do that for you automatically.
Rajit: The key factor is monitoring your usage. You have different clusters, Apache Hive, Apache Spark, Presto…to monitor all these clusters, you need a proper Cost Explorer, clock cost analyzer. So we actually created some models to find out where we are spending more money, which clusters are the money-hungry clusters. Regular monitoring and building models on data gathered from AWS Cost Explorer and Qubole Cost Explorer is the top priority for us now.
We are also constantly asking ourselves the following:
- Do we need all the data we have in AWS S3? Can we change the storage class of certain sections of data?
- Do all clusters really need to function at this time?
- Can we scale down cluster sizing?
- Can we stop some resource hungry applications?
- Can we off-board some users?
TIP 3:Have a running checklist and tools to get the right answers.
How are your teams taking advantage of cloud data platform facilities to reduce costs and improve efficiency?
Rajit: I’d say the top 5 ways we’re leveraging cloud data platform facilities are to:
- Isolate workloads in different environments, different clusters
- Isolate user workloads in different clusters
- Automictically stop clusters when they are not needed
- Leverage a high percentage of Spot nodes with heterogeneous nodes in clusters
- Use auto-elasticity of clusters
Justin: Working collaboratively in two, or more, cloud environments makes it really hard to be efficient. Even with careful planning, we still have a degree of redundant processes or data sets that live in two places, both with their share of compute costs. We’re trying to identify which data sets benefit multiple teams and get those projects prioritized for migration so other teams can plug into them.
We’re also a big company with many platforms and tools. Qubole is our main big data processing platform, but it’s constantly being integrated with other solutions across our multiple cloud environments, and we’re often re-evaluating how those integrations are done.
If we were to define financial governance as a transparency into where the costs are being consumed and provide the safety guard, what are the different ways in which you are using financial governance to create value for your organization?
Justin: We’re using Qubole’s Cost Explorer Notebooks for a lot of ad hoc reporting as many of our legacy reporting tools have been discontinued. Dev teams have also found Cost Explorer useful to justify query costs vs project revenue. Personally, I track QCU/compute costs on a team-by-team basis relative to the overall org totals to see who is using the most and whether they are increasing/decreasing month-to-month.
Rajit: We’re using financial governance to create value for the organization through cost monitoring and optimization, ROI analysis of all our workloads, creating internal chargeback models, and reusing data in multiple environments. Qubole came up with the Cost Explorer feature which shows how much QCU you are using. QCU means the unit where we measure how much Qubole units we are using.
Which cluster is using how much QCU, what time and the historical pattern– that data is already part of the Qubole platform. And then we can use that data to do up some visualizations like, which cluster are you using how much QCU? Which environment is how much QCU? What is the distribution between your spot node and on demand node?
TIP 4: Leverage Cost Explorer which can provide data with correct attributes to drive financial conversation.
View the recording of the panel here.