Why Companies Stumble at Delivering Access to Big Data

Start Free Trial
October 30, 2018 by Updated April 6th, 2024
This blog is the second installment in a two-part series about self-service access to data. Read the first post here.

We all know the issue of delivering self-service access to data is a problem with no foreseeable end in sight — businesses will remain perpetually behind the speed at which data volume, tools, and use cases multiply. Yet I see many companies fall into the same patterns in their attempts to address these challenges. As the demand for data skyrockets, enterprises continue to rapidly expand their investment in personnel and technology. While this approach to big data complexity seems sufficient at first glance, it lacks the agility and flexibility companies need to successfully run their businesses and prepare for future data demands.

An Insatiable Demand for Personnel

In an effort to get ahead of big data challenges, businesses are frequently resorting to hiring more people. Data teams are bringing on new personnel (or hiring third parties) to conduct penetration testing, manage the infrastructure, operate specialized technologies, or meet data demands from the broader organization. Unfortunately, this approach has two critical flaws — it’s become incredibly difficult to find the right talent, and hiring additional people acts as a band-aid to the predicament rather than a long-term solution.

Although the talent gap is affecting many areas of the business, it has become particularly problematic for IT and data professionals. By 2022, the demand for data science skills will lead to 2.7 million open positions, while the shortage of security experts is expected to reach nearly 2 million. Even if IT leaders wish to build out their data, security, open-source software, and infrastructure teams, they will have trouble finding the right candidates to fill those roles.

The bigger issue with using hiring to solve your data challenges is the feasibility of maintaining this cycle in the long run. Adding headcount addresses a symptom — resource limitations — as opposed to addressing the full scope of the problem (i.e. the complexity inherent to a big data infrastructure). What’s more, the talent gaps that many data teams are experiencing are making it impossible to match the speed at which big data is evolving. And the problem is only becoming more prevalent: 65 percent of CIOs say a lack of IT talent prevents their organization from keeping up with the pace of change.

A Deluge of Technology

Companies are eager to reap the rewards that big data tools offer. Yet with so many technologies built to address a specific purpose and so few designed to tackle an entire system of processes, one size no longer fits all. In the realm of big data, each open-source engine offers distinct advantages for specific types of workloads. Your data analysts may rely on Presto, while your data scientists may leverage Apache Spark and Notebooks and your data engineers may leverage Airflow. Unfortunately, the presence of multiple engines can create further chaos for those managing the infrastructure. When three-quarters of businesses are actively using multiple big data engines to conduct their workloads, data teams need both the software expertise and the manpower to successfully maintain and operate big data tools within the broader technology stack.

Often, enterprises will rely on technology to fill their various talent and resource gaps — and end up with a messier, more confusing infrastructure to manage. For instance, your team may elect to deploy data governance or security tools if you lack the headcount or are unable to find the right talent. Or, perhaps you’ve invested in a Business Intelligence (BI) tool to provide easier data access for non-technical users. Having a range of tools theoretically enables you to optimize processes and explore new project areas, though these technologies come with a caveat — you must have a plan in place to derive business value from these additions.

In reality, many enterprises lack the ability to determine where and how a new technology investment will fit in before it has been deployed. This situation results in a clunky, ad hoc infrastructure that requires a sizeable team to manage the technology stack. Only 40 percent of big data administrators are able to support more than 25 users — which means the majority of businesses are devoting extensive manpower to provisioning licenses, securing sensitive data, and regulating access controls. Your infrastructure can quickly become a black hole for resources that impact project success and delay business-critical data initiatives.

Out-of-Control Costs

Businesses around the world are jump-starting or increasing their investment in big data technology: IDC forecasts worldwide revenue for big data and analytics solutions will reach $260 billion by 2022. Coupled with an expanding roster of data professionals, the focus on big data resources is inevitably leading to rising costs. Not only will you spend significant funds on your team’s big data resources, but you’ll also accumulate unexpected costs — particularly around cloud computing.

Unlike traditional hardware and software, modern tools and technologies use a more complex pricing model — and, as a result, costs can be exceedingly difficult to contain. Every cluster you spin up and every model you run uses compute power, but that’s just the beginning. Your costs can easily grow out of control when you factor in some of the likely scenarios you’ll encounter with a cloud infrastructure:

  • Costs can creep up on those who are new to the cloud, especially if you don’t track the extent of jobs being run or how many clusters are active
  • If you exceed your provisioned capacity, you must pay an additional fee
  • Without access to autoscaling capabilities, you may end up paying for clusters running with no active jobs
  • The large, bursty workloads of big data can cause drastic, unpredictable changes in compute power (which pushes costs higher and higher)

The larger and more complex your infrastructure, the more difficult tracking every line item becomes. Compute costs can balloon rapidly — such as from clusters that a team spun up but forgot to downscale after completing the required jobs — and can catch even seasoned companies by surprise.

The Solution: Self-Service Access

I’ve seen firsthand just how ineffective it is to continuously throw tools and people at the dilemma of how to provide access to data. This trend hijacks your budget and resources, leaving your business far short of its desired data-driven state. Such a solution creates the perfect storm of an over-complicated infrastructure and manual, time-consuming data operations processes.

Nearly half of all companies (44 percent) now store 100 terabytes of data or more in their data lake, further increasing the complexity of connecting that data to the users who need it. To succeed in this data-driven world, businesses must provide data users with self-service access to stored data. In my next and final post of this series, I’ll discuss how an enterprise can make self-service a reality without sending costs skyrocketing or placing strain on the infrastructure.

Find out how Qubole can help you deliver self-service access to data users in this Introduction to Qubole

Start Free Trial
Read Presto Performance for Ad Hoc Workloads on AWS Instance Types