The Data On Big Data
Introducing the Qubole 2018 Big Data Activation Report
In our first-ever Big Data Activation Report, we analyze anonymous data from over 200 customers to provide insights into how companies are activating their big data. More specifically, we analyzed how companies are putting their data to use.
The amount of data in the world is growing—to 44 zettabytes, or 4 trillion gigabytes, by 2020 according to IDC. Companies are increasingly looking to apply AI, Machine Learning and other advanced analytics to increase revenue, subscribers, engagement and more. The cloud is gaining traction as more and more companies are realizing the benefits to elasticity.
To better understand how companies are putting their data to use, we looked at our own customers’ data and asked questions like:
- What technology and open source engines are they using most and why?
- How fast are open source big data engines like Apache Hadoop/Hive, Apache Spark and Presto growing?
- How many users now have access to data?
- What impact is the cloud and automation having on efficiency and cost savings?
Our Big Data Activation Report answers these questions and more. The report goes in-depth into our findings such as:
76% of organizations are leveraging multiple big data engines
In order to handle multiple data types (structured, unstructured, etc) and multiple use cases (BI analytics, machine learning, data preparation, etc), companies are matching the strengths of each engine with their specific needs to become more efficient, work faster, and keep costs down.
Total compute hours across the three major engines has grown 162% in 2017
With all types of data are being collected, data-driven companies are doubling down in their efforts to build data-driven applications that generate revenue streams. Compute usage has grown across all three engines– in particular for Apache Spark, which grew 298%, and Presto, which grew 420%.
The number of users that ran commands doubled in the past year
Not only are we seeing better User to Admin ratios, but the year-over-year increase in the number of users that ran commands is 255% for Presto, 171% for Apache Spark, and 136% for Apache Hadoop/Hive.
Companies are becoming more efficient by picking the right engine for the right job
For example, companies are increasingly turning to Presto to handle interactive analysis such as BI or data discovery. Customers in aggregate are running 24x more commands per hour in Presto than Apache Spark and 6x more commands than Apache Hadoop/Hive.
In 2017, 54% of all Amazon EC2 compute hours were spot instances
Not only is the cloud showing tangible benefits with regards to elasticity and scalability, it’s also driving significant reductions in costs. In 2017, spot instance usage across all engines resulted in an estimated $230 million in savings.
Those are are just a few of the findings from our inaugural report. Each, in their own way, illustrate why having the right technology will help you put your data to use and keep your business growing.
If you have any questions about the report, do not hesitate to email us at