This post was originally published August 2014 and has since been updated.
Once the subject of speculation, big data analytics has emerged as a powerful tool that businesses can use to manage, mine, and monetize vast stores of unstructured data for competitive advantage. As a result, the rate of adoption of Hadoop big data analytics platforms by companies has increased dramatically.
In this rush to leverage big data, there has been a misconception that Hadoop is meant to replace the data warehouse, when in fact Hadoop was designed to complement traditional Relational Data Base Management Systems (RDBMS).
To clear-up any confusion for those considering Hadoop, here’s a look at how adding Qubole’s cloud-based Hadoop service to existing data architecture can be a winning combination for your business.
Processing Structured Data is something that your traditional database is already very good at. After all, structured data, by definition is easy to enter, store, query and analyze. It conforms nicely to a fixed schema model of neat columns and rows that can be manipulated by Structured Query Language (SQL) to establish relationships. As such, using Hadoop to process structured data would be comparable to running simple errands with a Formula One racecar. However, with the rise of big data, many of those simple errands have become quite complex, calling for a more powerful and streamlined solution than the data warehouse can offer.
Storing, managing and analyzing massive volumes of semi-structured and unstructured data is what Hadoop was purpose-built to do. Unlike structured data, found within the tidy confines of records, spreadsheets and files, semi-structured and unstructured data is raw, complex, and pours in from multiple sources such as emails, text documents, videos, photos, social media posts, Twitter feeds, sensors and clickstreams.
If your organization is dealing with growing volumes of raw complex data—and what company isn’t these days—you’ve no doubt discovered that your traditional database just isn’t cut out for this workload. After all, SQL databases can only be scaled vertically—that is by enhancing the horsepower of the implementation hardware. Being that most data warehouses are built on specialized infrastructure, processing large batches of data can be very costly. The other problem with your RDBMS with respect to performing more complex big data workloads is that the raw data must first be put through a cleaning and structuring process called ETL (Extract, Transform, and Load) before putting it into the warehouse. But this pre-processing of data can be plagued with errors. Plus it permanently eliminates all potentially valuable raw data while limiting how the resultant “clean and simple” data can be queried.
Hadoop as a Service provides a scalable solution to meet ever-increasing data storage and processing demands that the data warehouse can no longer handle. With its unlimited scale and on-demand access to compute and storage capacity, Hadoop as a Service is the perfect match for big data processing. Using tools found within the Hadoop ecosystem, such as Pig, Spark, Presto and others, Hadoop as a Service will help you to obtain the deeper insights often hidden in unstructured data that can propel your business forward.
Running constant and predictable workloads is what your existing data warehouse has been all about. And as a solution for meeting the demands of structured data—data that can be entered, stored, queried, and analyzed in a simple and straightforward manner—the data warehouse will continue to be a viable solution. But when it comes to handling massive volumes of unstructured data, that’s where the warehouse falls short.
Running fluctuating workloads to meet growing big data demands requires a scalable infrastructure that allows servers to be provisioned as needed. That’s where Qubole’s cloud-based Hadoop service comes in handy. With the ability to spin virtual servers up or down on demand within minutes, Hadoop in the cloud provides the flexible scalability you’ll need to handle fluctuating workloads.
Keeping costs down is a concern for every business in today’s ultra-competitive arena. And traditional relational databases are certainly cost effective. If you are considering adding Hadoop to your data warehouse, it’s important to make sure that your company’s big data demands are genuine and that the potential benefits to be realized from implementing Hadoop will outweigh the costs. While on-premise Hadoop implementations save money by combining open source software with commodity servers, a cloud-based Hadoop platform will save you even more by eliminating the expense of physical servers and warehouse space entirely. Hybrid systems, which integrate Qubole’s cloud-based Hadoop with traditional relational databases, are fast gaining popularity as cost-effective ways for companies to leverage the benefits of both platforms.
Running large distributed workloads that address every file in the database is something that Hadoop handles very well, but not very fast. And so the tradeoff with this type of processing is slower time-to-insight.
Shorter time-to-insight necessitates interactive querying via the analysis of smaller data sets in near or real-time. And that’s a task that the data warehouse has been well equipped to handle. However, thanks to a powerful Hadoop processing engine called Spark, Hadoop—and in particular Qubole’s Hadoop as a Service—can handle both batch and streaming workloads at lightening fast speeds. Spark is designed for advanced, real-time analytics and has the framework and tools to deliver when shorter time-to-insight is critical.
Spark on Hadoop supports operations such as SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms. In addition, Spark enables these multiple capabilities to be brought together seamlessly into a single workflow. And being that Spark is one hundred percent compatible with Hadoop’s Distributed File System (HDFS), HBase, and any Hadoop storage system, virtually all of your organization’s existing data is instantly useable in Spark.
The combination of a traditional database with Qubole’s cloud-based Hadoop platform can be a powerful, cost-effective analytical tool for your business. Properly implemented, this hybridized data infrastructure allows companies to reap the benefits of both platforms by running small, highly interactive workloads in the data warehouse while using Hadoop to process very large and complex data sets to obtain deeper insights and drive competitive advantage.