Is Your Big Data Initiative Scalable?
- By Ari Amster
- April 14, 2016
The benefits of big data in the enterprise are no longer in question. Thanks to Hadoop, organizations both large and small are finding real value in capturing, storing, and analyzing large volumes of unstructured data. However, as data volumes continue to rise at exponential rates, organizations looking to stay profitable and competitive must be able to meet ever-increasing data storage and processing demands. Which means that their big data initiatives must be easily and fully scalable.
If you are an IT leader of an organization looking to implement a big data strategy, here’s a look at the importance of being able to scale all of the elements of your big data initiative to successfully meet the current and future data needs of your company.
Scaling Your Infrastructure
Unlike a traditional monolithic RDBMS, which can only scale vertically, Hadoop’s horizontal scalability is of real benefit to organizations with large data storage, management, and analytics needs. However, Hadoop’s ability to scale in a physical environment is limited by the number of commodity servers at hand. And adding more physical servers can be time-consuming and costly.
Hadoop in the cloud offers vastly superior scalability to on-premises Hadoop. But the promise of elastic and unlimited scalability that many cloud-based Hadoop vendors offer comes at a price. After all, scaling Hadoop workloads in the cloud can be very difficult, often placing extra burdens on the user. Many of these workloads are bursty by nature, requiring users to manually spin clusters up and down to meet changing load demand, while sizing them right in the process. Having to constantly monitor node utilization cuts into the benefits of cloud scalability and compromises efficiency.
Qubole’s auto-scaling Hadoop technology allows users to run Hadoop workloads without having to worry about cluster management and scaling. And with the ability to easily and rapidly scale resources up or down according to data demands, you and your IT department will have all of the resources you need, while your organization pays only for the resources that are used.
Scaling Your Analytics Team
Along with being a potentially costly and time consuming process, scaling Hadoop in a physical environment can also increase the need to develop a structure and culture within the IT department that allows the data science team to work together more effectively. And while that’s a good thing for any organization, things can become challenging with the addition of highly specialized talent needed to maintain the expanded system. As a result, scaling the analytics team to meet new data demands can be a slow and costly process in and of itself—a process that does not necessarily lead to greater collaboration among team members.
True collaboration among IT teams and business leaders across the organization—the type of collaboration that results in actionable insights—can only happen when all stakeholders within the organization have access to the data. Fortunately, a number of cloud-based Hadoop vendors offer tools that can help to improve data access and collaboration. However, many of these solutions still don’t scale very well, as they require a fair amount of administrative support in order for users to gain access to the data they need. That’s a problem because the more users who can access the platform without administrative support, the more efficiently an organization will be able to meet its true business objectives.
A better answer to the problem is a vendor with a self-service model, driven by policy-based automation. Qubole offers a viable solution for scaling your analytics team. For every 1 administrator, there are 21 users that are directly accessing data through Qubole’s self-service platform.
The ability to effectively manage costs is a critical element of any big data initiative. And that’s where scalability can make all the difference.
For on-premises Hadoop deployments that are hit with bigger data storage and processing demands—forcing them to add additional commodity or proprietary hardware and IT professionals to meet those demands—costs can be very difficult to manage. And cost overruns have doomed many a big data project.
Cloud-based Hadoop solutions can help organizations better manage costs by completely eliminating the expense of adding more hardware, and by reducing the number of additional data engineers that will need to be hired. But as mentioned previously, many of these implementations require user input to spin clusters up or down depending on demand. This can prove costly during bursty workloads when more clusters than are actually needed for the job may be inadvertently spun up by the user. Additional compute costs can also be incurred during the more challenging downscaling process, or when clusters that are no longer needed aren’t removed from commission quickly and efficiently.
Qubole’s auto-scaling and cluster lifecycle management technologies are very effective at keeping compute costs manageable. That’s because auto-scaling ensures that just the right amount of infrastructure—no more, no less—is always on hand to handle the workload. And cluster lifecycle management persists results and logs before terminating clusters automatically, so users do not need to constantly monitor cluster utilization. Many organizations that use auto-scaling experience a dramatic increase in query execution, accompanied by an equally impressive decrease in costs.
Full scalability in a big data initiative ensures the existence of another important element—flexibility. Fully elastic and unlimited scalability, coupled with the auto-scaling technology that Qubole’s cloud-based Hadoop solution offers, allows you and your IT team to choose from a wide variety of instance types to match your stack to your workload.
Big data storage and processing demands are rapidly rising, with no end in sight. Going forward, IT leaders must make certain that their big data initiatives are fully scalable in order to successfully meet their organization’s current and future data needs.