Cassandra vs. Hadoop: A Comparative Look
- By Ari Amster
- January 28, 2016
Technology is reshaping our world. The proliferation of mobile devices, the explosion of social media, and the rapid growth of cloud computing have given rise to a perfect storm that is flooding the world with data. The challenge for enterprises is that, according to Gartner estimates, 80 percent of this “big data” is unstructured, and it’s growing at twice the rate of structured data.
In light of this exponential growth of chaotic data, there has never been a greater need for data solutions that go beyond what traditional relational databases can offer. That’s where the open source big data analytics platform Apache Hadoop, and the NoSQL application Apache Cassandra enter the picture.
What follows is a brief comparison of the differences between Hadoop vs. Cassandra, along with how these two solutions can complement each other to deliver powerful big data insights.
What is Hadoop?
A product of the Apache Software Foundation Project, Hadoop is a big data processing platform that utilizes open source software, a distributed file system (HDFS) and a programming framework known as MapReduce to store, manage and analyze massively large sets of unstructured data in parallel across distributed clusters of commodity servers at very high scale. With Hadoop, both HDFS and the MapReduce framework run on the same set of nodes. This allows the Hadoop framework to effectively schedule compute tasks on nodes where data is already being stored. As a result, Hadoop is best suited for running near time and batch oriented analytics on vast lakes of “cold”, aka, historical data—in multiple formats—in a reliable and fault tolerant manner.
While MapReduce is a strong and reliable data processing tool, it’s main drawback is a lack of speed. That’s to be expected, as most map/reduce jobs are long running batch jobs which can take minutes or hours or even longer to complete. Clearly, the growing demands and aspirations of big data call for faster time to insight, which MapReduce’s batch workloads aren’t designed to deliver.
What is Cassandra?
Fundamentally, Cassandra is a distributed NoSQL database designed to manage vast amounts of structured data across an array of commodity servers. Cassandra boasts a unique architecture that delivers high distribution, linear scale performance, and is capable of handling large amounts of data while providing continuous availability and uptime to thousands of concurrent users. Unlike Hadoop, which is typically deployed in a single location, Cassandra’s high distribution allows for deployment across countries and continents. In addition, Cassandra is always up, always on, and delivers very consistent performance in a fault tolerant environment. This makes Cassandra ideal for processing online workloads of a transactional nature, where Cassandra is handling large numbers of interactions and concurrent traffic with each interaction yielding small amounts of data.
In contrast to Hadoop, which can accept and store data in any format—structured, unstructured, semi-structured, images, etc.—Cassandra requires a certain structure. As a result, a lot of thinking is required to structure a Cassandra data model vs. Hadoop model before it can be successfully implemented at scale.
How Does Cassandra Compare to HBase?
HBase is a NoSQL, distributed database model that is included in the Apache Hadoop Project. It runs on top of the Hadoop Distributed File System (HDFS). HBase is designed for data lake use cases and is not typically used for web and mobile applications. Cassandra, by contrast, offers the availability and performance necessary for developing always-on applications.
Combining Cassandra and Hadoop
Today’s organizations have two data needs. The need for a database devoted to online operations and the analysis of “hot” data generated by Web, mobile and IOT applications. And the need for a batch oriented big data platform that supports the processing of vast amounts of “cold” unstructured historical data. By tightly integrating Cassandra and Hadoop to work together, both needs can be served.
While Cassandra works very well as a highly fault tolerant backend for online systems, Cassandra is not as analytics friendly as Hadoop. Deploying Hadoop on top of Cassandra creates the ability to analyze data in Cassandra without having to first move that data into Hadoop. Moving data off Cassandra into Hadoop and HDFS is a complicated and time-consuming process. Thus Hadoop on Cassandra gives organizations a convenient way to get specific operational analytics and reporting from relatively large amounts of data residing in Cassandra in real time fashion. Armed with faster and deeper big data insights, organizations that leverage both Hadoop and Cassandra can better meet the needs of their customers and gain a stronger edge over their competitors.