What is Big Data Analytics?

  • Home >
  • Big Data Analytics

Overview of Big Data Analytics

Big Data Analytics offers a nearly endless source of business and informational insight, that can lead to operational improvement and new opportunities for companies to provide unrealized revenue across almost every industry. From use cases like customer personalization, to risk mitigation, to fraud detection, to internal operations analysis, and all the other new use cases arising near-daily, the Value hidden in company data has companies looking to create a cutting-edge analytics operation.

Discovering value within raw data poses many challenges for IT teams. Every company has different needs and different data assets. Business initiatives change quickly in an ever-accelerating marketplace, and keeping up with new directives can require agility and scalability. On top of that, a successful Big Data Analytics operation requires enormous computing resources, technological infrastructure, and highly skilled personnel.

All of these challenges can cause many operations to fail before they deliver value. In the past, a lack of computing power and access to automation made a true production-scale analytics operation beyond the reach of most companies: Big Data was too expensive, with too much hassle, and no clear ROI. With the rise of cloud computing and new technologies in compute resource management, Big Data tools are more accessible than ever before.

Where did Big Data originate from?

Big Data emerged from the early-2000s data boom, driven forward by many of the early internet and technology companies. Software and hardware capabilities could, for the first time in history, keep up with the massive amounts of unstructured information produced by consumers. New technologies like search engines, mobile devices, and industrial machines provided as much data as companies could handle—and the scale continues to grow.

In a study conducted by IDC, the Market Intelligence firm estimated that the global production of data would would grow 10x between 2015 and 2020.

With the astronomical growth in collectible data, it soon became evident that traditional data technologies such as data warehouses and relational databases were not well-suited to operate with the influx of unstructured data. The early Big Data innovation projects were open-sourced under the Apache Software Foundation, with most significant contributions coming from the likes of Google, Yahoo, Facebook, IBM, academia, and others. Some of the most widely used engines are:

  • Apache Hive/Hadoop (developed at Yahoo!, Google, and Facebook) is the workhorse for complex ETL and data preparation that services information to many analytics environments or data stores for further analysis.
  • Apache Spark (developed at University of California, Berkeley) tends to be used with heavy compute jobs that are typically batch ETL and ML workloads, but is also used in conjunction with technologies such as Apache Kafka.
  • Presto (developed by Facebook) is a SQL engine that is lighting fast and reliable for reporting and ad-hoc analytics.

Typical Deployment

What is different with Big Data today?

As data grows exponentially, enterprises need to continuously scale their infrastructure to maximize the economic value of the data. In the early years of Big Data (roughly 2008), when Hadoop was first getting recognition by larger enterprises, it was extremely expensive and inefficient to stand up a useful production system. Using Big Data also meant that there needed to be the right people and software technology, as well as hardware to handle the data and velocity of queries coming in. Aligning everything to operate synchronously was an extremely daunting task, and caused many Big Data projects to fail.

By 2013, the notion of the enterprise cloud for analytics was becoming popularized by Amazon Web Services (AWS) and a few number of other Silicon Valley companies (VMWare, Microsoft, and IBM) started emerging with their take of enterprise solutions for companies to take advantage of leveraging cloud computing. It wasn’t until AWS announced their earnings in 2015 of nearly $5 billion in revenue for the year, that the world truly started to take notice.

The cloud has shaped into a market-changer today as businesses, large and tiny, can have instantaneous access to infrastructure and advanced technologies with a few clicks. This allows the Data Admin and DevOps teams to be the enabler of the entire platform operation, and no longer a bottle neck. Back to the earlier comment on the 4 V’s of big data, this is where cloud provides a great infrastructure to enable companies to grow beyond their existing systems:

  • Volume – information is growing and data has an expiration date with value, having cheap cloud storage enables companies to take on massive amounts of data without worrying about what is and isn’t valuable.
  • Variety – demand for analyzing on unstructured data is growing, which is driving the need different frameworks such as Deep Learning in order to process. Ephemeral cloud computing servers allow companies to test different big data engines against the same data iteratively.
  • Velocity – complexity of analytics problems require several steps of big data (e.g. Machine Learning is estimated to be ~80% ETL in compute resources), which cloud computing companies can scale up/down according to demand.
  • Value – demand for AI driven applications is pushing demand for modern big data architectures, which allow applications, storage and compute resources each to be scaled out individually.

Big Data Analytics vs. Business Intelligence

Business Intelligence is often times referred to as the first two descriptive and diagnostic stages of 4 steps to big data. BI is often hosted in a data warehouse where data is very structured in nature and only explains “what, where, and how” something happened (for example: 10 of the same shoes were purchased from 3 different stores that ran the same promotion, while the other 2 stores sold no shoes). This data is often used in reporting and gathering insights into popularity trends and interactions based on recent events.

Big Data Analytics takes this a step further, as the technology can access a variety of both structured and unstructured datasets (such as user behavior or images). Big data technologies can bring this data together with the historical information to determine what the probability of an event were to happen based on past experiences.

Why You Need Big Data in the Cloud Today?

The 4 V’s have been a well known catalyst for the growth of Big Data analysis in last decade. Moreover, we have entered into a new era where new challenges are evolving like “variety” of open source technologies, Machine Learning use cases, and the rapid development across the big data ecosystem. These have added new challenges around how to keep up with the ever-growing information, while balancing how to ensure the effectiveness of advanced analytics in such a noisy environment.

Predictive and Prescriptive analytics is in a transient state, and requires modern infrastructure that traditional data warehouses can’t service. Having a big data platform that enables teams appropriate self-service access to unstructured data, enables companies to have more innovative data operations.

  1. Descriptive analytics (What Happened and When) – This is common in traditional Business Intelligence and reporting analytics.
  2. Diagnostic analytics (Where and How it Happened) – This takes Business Intelligence a step further, where the end user could be given a report or have a set of actions sent to them based on the results of the data.
  3. Predictive analytics (What Will Happen and How) – Where a model is applied to the data and a decision or probability score is given based on historical events. This data can also be fed back into Business Intelligence systems to help with future decision making.
  4. Prescriptive analytics (What Should We Do) – Takes the predicted output of the data and places it into a practical application that makes recommendations or alerts end-users (such as with fraud detection or ecommerce shopping). This data usually needs to be put into a data mart that can feed out to an application in near-real time.

Big data has become an essential requirement for enterprises looking to harness their business potential. Today both large and small businesses enjoy greater profitability and competitive edge through the capture management, analysis of vast volumes of unstructured data. However, all organizations have realized they require a modern data architecture for going to the next level. This need has led to the emergence of data lakes.

According to Dr. Kirk Borne, Principal Data Scientist & Data Science Fellow, Booz Allen Hamilton:
“The biggest challenge of Big Data is not volume, but data complexity or data variety. Volume is not the problem because the storage is manageable. Big Data is bringing together all the diverse and distributed data sources that organizations have across many different sources of data. Data silos inhibit data teams from integrating multiple data sets that (when combined) could yield deep, actionable insights to create business value. That’s what a Big Data Lake can do.”

Watch his keynote session from the Data Lake Summit here

Use Cases (Data Science, ETL, interactive analytics, BI)

Qubole, a cloud-native Big Data activation platform, is useful at any scale because the technology operates on the notion of separating storage from compute, and furthermore having managed autoscaling for Apache Hadoop, Apache Spark, Presto, and TensorFlow. The software automates cloud infrastructure provisioning, which saves a ton of time from data teams getting bogged down in administrative tasks such as cluster configuration and workload monitoring.

  • ETL – build and schedule pipelines for recurring data transformations
  • Data Science and Machine Learning – explore, develop and test models at scale before putting them into production
  • Interactive Analytics – enable data teams and less technical users to analyze less structured or raw data that otherwise can’t fit in a data warehouse
  • Data Visualization – format and present analytics for business insight and intuitive dashboards
  • Self-service Access
  • Multiple Use Cases
  • Financial Governance
  • Elastic Scale
  • Security
Data Prep
and Ingestion
AI and
Machine Learning
Cloud-Native Data Platform for AI, Machine Learning, and Analytics
Data Cloud

Qubole Big Data Analytics Capabilities

Right Tool, Right Job

  • Choose from a multitude of engines (Hive, Spark, Presto, Tensorflow).

Development Benches

  • Instantly start writing your code in a variety of interfaces and workbenches – Dashboards, Notebooks, command line, API, and more.

Performance and scale per workload

  • Managed autoscaling for Spark, Hadoop, and Presto workloads. Automated cluster management for ease of administration and enabling self-service access to users through various interfaces

Big Data Ecosystem Integrations

  • Scheduler/ETL tools – Talend, Informatica, Oozie, Azkaban, Apache AirFlow
  • BI software – Looker, Tableau, Apache Superset, Periscope, Qlik
  • Security and Governance – Apache Atlas, Apache Ranger, SSO integrations, Encryption in Motion and at Rest

See Why The World’s Most Data-Driven Companies Choose Qubole