A 5-Minute Guide to Big Data Tools
- By Jonathan Buckley
- September 17, 2015
Hadoop, Hive, Spark, Presto, Pig, NoSQL—these are words you’d expect to find in a whimsical Dr. Seuss tale. In fact, they are the names of powerful tools found in a world once thought to be just as nonsensical as any story Dr. Seuss could dream up—the world of Big Data.
For those who would like to get better acquainted with Big Data—in particular the devices that can help organizations extract value and competitive advantage from massive sets of structured and unstructured information—here’s a 5-minute guide to some prominent big data tools.
Once dismissed as having no practical application, Big Data is a force to be reckoned with today—all thanks to an open source software project called Hadoop. Named in true Dr. Seuss fashion after a toddler’s stuffed toy elephant, Hadoop has evolved into a foundational big data analytics tool. Using commodity hardware and open source software, Hadoop’s distributed file system (HDFS) facilitates the storage, management and rapid analysis of vast datasets across distributed clusters of servicers. Hadoop has many features that make it an attractive big data processing powerhouse for organizations—features such as flexibility to handle multiple data formats, scalability to accommodate very large workloads, and affordability that allows organizations with modest budgets to reap big data benefits.
Financial institutions are using Hadoop as a vital part of their security architecture. Armed with the ability to analyze massive data sets in real-time, Hadoop helps banks detect phishing behaviors and fraudulent payments and take quick action to minimize their impact.
In the past, traditional relational database management systems (RDBMS) have been very effective at handling large sets of structured data. That’s because structured data conforms nicely to a fixed schema model of neat columns and rows that can be manipulated by Structured Query Language (SQL) to establish relationships. And then big data came along, and the RDBMS could no longer meet new data management needs. A new way to store, manage and query massive sets of messy unstructured data was needed. That solution is Not Only SQL (NoSQL), a database technology that runs on the Hadoop analytics platform. No longer bound by the confines of a fixed schema, NoSQL database solutions allow businesses to be more flexible and agile at storing, retrieving and analyzing massively large volumes of disparate and complex data, and doing so at lightning fast speeds.
Companies such as Amazon, that require speed and agility to execute millions of financial transactions with their customers each and every day, rely on the speed and scaling that NoSQL on Hadoop database solutions can provide.
In its early days, Hadoop was an effective solution for organizations looking to store and manage mountainous volumes of data. However, analyzing that data for insights proved to be a problem best left to skilled data scientists, leaving business analysts in the dark. In answer to that problem, two Facebook data scientists created Apache “Hive” in 2008. Based on the fact that SQL is a widely used and commonly understood language among data engineers, Hive was designed to automatically translate SQL-like queries into MapReduce jobs on Hadoop—all through the use of a language called HiveQL. As a result, Hive transformed Hadoop by placing serious analytics power in the hands of key decision makers within organizations who didn’t have PHD’s in data science. In cases where summarizing, querying, and analyzing large sets of structured data is not time sensitive, Hive is more than up to the task.
The folks at Facebook, which helped to develop Hive, utilize this simple and scalable big data tool to help meet the company’s formidable reporting needs.
Like many tools, Hive comes with a tradeoff, in that its ease of use and scalability come at the expense of speed. And that’s where “Spark” enters the picture. The brainchild of UC Berkeley’s AMPLab, open source Apache Spark is a powerful Hadoop processing engine that can handle both batch and streaming workloads at lightning fast speeds.
Spark supports operations such as SQL queries, streaming data, and complex analytics, i.e. machine learning and graph algorithms. Plus, Spark enables these multiple capabilities to be combined seamlessly into a single workflow. And since Spark is one hundred percent compatible with Hadoop’s Distributed File System (HDFS), HBase, and any Hadoop storage system, virtually all of an organization’s existing data is instantly usable in Spark.
Iterative Machine Learning algorithms and Interactive Analytics are the primary use cases for Spark.
Created by Facebook engineers, “Presto” is a massively scalable, open source, distributed query machine that enables fast interactive analysis of vastly large data sets. Running exclusively in memory, Presto can run simple queries on Hadoop in a few milliseconds, while more complex queries take only a few minutes. Shown to be more than seven times more efficient on the CPU than Hive, Presto can merge multi-source data into a single query, thus providing analytics across an entire organization.
Running interactive analytic queries on data sources ranging from gigabytes to hundreds of petabytes is a main use case for Presto—a tool that has transformed the Hadoop ecosystem.
As big data gets bigger and technology continues to advance, more tools with Dr. Seuss sounding names will no doubt be developed to meet future big data demands.