In 2009 I first started playing around with Hive and EC2/S3. I was blown away by the potential of the cloud. But it bothered me that the burden of sizing the cluster was put on the user. How would an analyst know how many machines were required for a given query or a job? To make it worse – one had to even decide whether to add map-reduce or HDFS nodes. Within a single user session – different queries required different numbers of machines. What if I run multiple queries at once? And finally – I was paying by the CPU hour – so I had to remember to de-provision these machines.
This wasn’t a problem for human beings to solve – computers should take care of managing computers. Three years have passed since then. When we started Qubole we found that nothing had really changed. The state of the art still involved manually spinning clusters up and down and sizing them right. And if you wanted to be cost-conscious – remember to spin them down at exactly the right time. Simplicity is one of the core values at Qubole. On-Demand Hadoop clusters are a core offering and we had to solve this problem for our customers.
Consolidating the thoughts in the previous sections, we came up with a specification roughly as follows:
- Hadoop clusters should come up automatically when applications that require them are launched. Many Hadoop applications – for example, the creation of a table by Hive – do not require a Hadoop cluster running at all times.
- If a Cluster is already running on behalf of a customer – then, new Hadoop applications should automatically discover the running cluster and run jobs against it.
- If the load on the Cluster is high – then the cluster should automatically expand. However – if the cluster is not doing anything useful – the same token nodes should be removed.
- Nodes should not be removed unless they approach the hour boundary:
- We pay for CPU by the hour. It makes no sense to release CPU resources that are already paid for.
- If the user were to return after a short coffee break – he would be better off using a running cluster rather than waiting for a new one to spin up. So not releasing nodes early results in a good experience for the user.
- It is quite likely that we want different auto-scaling policies based on the job. An ad-hoc query from an analyst may require quickly expanding the cluster – unlike a batch job that runs late in the night.
Now the only question was – how to orchestrate all this? Auto-Scaling for Hadoop is a good bit more complicated than auto-scaling for webserver type workloads:
- CPU utilization is not necessarily a good parameter of the utilization of a Hadoop node. A fully utilized cluster may not be CPU-bound. Conversely, a cluster doing a lot of network IO may be fully utilized without showing high CPU utilization. Coming up with fixed utilization criteria is hard as resource usage depends on the workload.
- The current cluster load is a poor predictor of the future load. Unlike web workloads – which move up and down relatively smoothly – Hadoop workloads are very bursty. A cluster may be CPU maxed out for a couple of minutes and then immediately become idle. Any Hadoop auto-scaling technology has to take into account anticipated future load – not just the current one.
- Hadoop nodes cannot be removed from the cluster even if idle. An idle MapReduce slave node may hold data that are required by reducers. Similarly – removing HDFS data nodes can be risky without decommissioning them first. We can’t afford to lose all copies of some datasets that may be required. Working with these constraints required access to Hadoop’s internal data structures.
Fortunately, we have had extensive experience with Hadoop internals during our days at Facebook. The Facebook installation was the most advanced in terms of pulling in new functionality available in the latest open-source Hadoop repositories and testing, fixing, and deploying those features at a ridiculous scale.
One of the features we had pulled in was a rewrite of Hadoop’s speculative execution and we had made extensive fixes and improvements to get this feature to work. This allowed us a depth understanding of the statistics collected by the JobTracker (JT) on the status of the jobs and tasks running inside a cluster. It turns out – that this information is exactly what is needed to make auto-scaling work for Hadoop. By looking at the jobs and tasks pending in the cluster and by analyzing their progress rates so far – we can anticipate the future load on the cluster.
We can also track future load for each different job type (ad-hoc vs. batch query for example). Based on all this information – we can automatically add or delete nodes in the cluster. We also needed cluster management software to start the cluster and add and delete nodes to it. After investigating various alternatives – we decided on using StarCluster as our starting point.
StarCluster is an open-source cluster management software that builds on the excellent Python Boto library. StarCluster is fairly simple to understand and extend – and yet very powerful. Nodes and other entities are modeled as objects within this software – and nodes can be added and deleted on the fly – and thanks to Boto – different types of instances can be provisioned – spot or regular. We customized this software heavily to fit our requirements.
Based on these observations, we enhanced the Hadoop JobTracker in the following way:
- Each Node in the cluster would report its launch time so the JT can keep track of how long the nodes are running
- The JT continuously monitors the pending work in the system and computes the amount of time required to finish the remaining workload. If the remaining time required exceeds some pre-configured threshold (say 2 minutes) and there is sufficient parallelism in the workload – then the JT will add more nodes to the cluster using the StarCluster interfaces.
- Similarly – anytime a node comes up on its hour boundary – the JT will check whether there is enough work in the system to justify continuing to run this node. If not, it will be deleted. However:
- Nodes containing task outputs that are required by currently running jobs are not deleted (even if they are otherwise not needed).
- Prior to removing a node that is running the DataNode daemon, it is decommissioned from HDFS. See below.
- The Cluster will never decrease below its minimum size.
Discovering and attaching to clusters was relatively easy. Using Hive as a prototypical map-reduce application – we identified a few key points in the code where a cluster (either map-reduce or HDFS) was required.
Hive would start off with sentinel values for the cluster endpoints – and at these key points in the code – would query StarCluster to either discover a running cluster – or start a new one. Finally, we run background processes to continuously monitoring customer workloads. If no queries/sessions are active – then the cluster is removed altogether.
One of the most tricky things about the setup was coming up with a scheme to have all nodes run as DataNodes.
To begin with – we only ran DataNode daemons on the minimum/core set of nodes. Our primary usage for HDFS is as a cache for data stored in S3 (we will talk about our caching technology in a future post). We quickly found out that even a slightly large cluster would overwhelm a small number of data nodes. So we should ideally run all nodes as DataNodes. But deleting DataNodes is tricky – it may take a long time to decommission nodes from HDFS – which primarily consists of replicating data resident on a DataNode elsewhere.
Given that most of the data in our HDFS instances were cached data – paying this penalty didn’t make sense. Realizing that cached data should just be discarded during the decommissioning process – we modified the Hadoop NameNode to delete all files that were part of the cache and that had blocks belonging to the nodes being decommissioned. (A big thanks to Dhruba for walking us through the decommissioning and inode deletion code!). This has made it possible for us to use all cluster nodes as data nodes and yet be able to remove them quickly. Our cache performance improved manifold once we made this change.
Fast Cluster Startup
Waiting for a cluster to launch is painful. Since it’s something we are frequently doing as part of our day job – we wanted to make this as fast as possible – for our own sake (never mind the users). Some points are worth mentioning here:
- Our clusters are launched using fully configured AMIs. No software installation during startup minimizes boot time
- We started off by using instance store AMIs – but then quickly switched to EBS images. The latter boot substantially faster
- The first versions of our cluster management software booted the master and then the slave nodes (because the slaves depended on the master IP address). Subsequently, we figured out a way to boot them in parallel – and this also led to significant savings.
- Finally, we looked closely at the Linux boot latency and eliminated some unnecessary services, and parallelized the initialization of some daemons.
All these things resulted in fast (~90s) and predictable cluster launch times.
Many areas of work remain. One aspect is allowing greater inter-mixing of spot and regular instances. Optimizing for cost in AWS is a fascinating area – and we have just scratched the surface. While we have started off by providing Hadoop clusters exclusive to each customer – we may at some point look at a shared Hadoop cluster for jobs/queries that run trusted/secure code.
Our auto-scaling strategies require a lot of fine-tuning. In particular – while we have tackled the problem of sizing the cluster to respond to the requirements of the queries – we have not tackled the inverse problem – of configuring the queries to work optimally within the potential hardware resources available.
Available for Early Access!
Our auto-scaling Hadoop technology is now generally available. Users can signup for a free account – and log in through a browser-based application to run Hadoop jobs and Hive queries without worrying about cluster management and scaling. While much more remains to be done – we are happy to have solved a fundamental problem within a few months of our existence.