Co-authored by Hariharan Iyer, Member of the Technical Staff at Qubole.
Amazon Elastic Block Storage (EBS) provides highly available block-level storage volumes for use with EC2 instances. Previous generations of EC2 instances had sufficient ephemeral storage (e.g. m1.xlarge has ~1.6TB) in each instance to accommodate for most big data workloads. However, with the newer instance types, the available ephemeral capacity has been reduced (e.g. m3.xlarge only has 80GB/instance) or removed entirely (m4 and c4 instances). Since these instances offer newer hardware and better performance at a lower price, most of our customers have started to use them as a workaround in the absence of ephemeral storage with EBS volumes.
Qubole Data Service (QDS) intelligently and automatically responds to both storage and compute needs of bursty and unpredictable big data workloads. It evaluates the current load on the cluster in terms of compute power and intelligently predicts how many additional workers are needed to complete the workload within a reasonable time. However, we’ve observed situations when the cluster is chuffing along smoothly from a compute perspective, but is running out of storage capacity. This could be because of job or mapper outputs being dumped in HDFS, for example.
In this blog post, we have described an approach we’ve implemented which dynamically scales the storage (independent of compute) to match the workload.
In Hadoop MapReduce 1.0, we implemented a solution for HDFS-based autoscaling which added new instances when the cluster was running out of HDFS capacity. However, in such cases you’d be paying for compute (which can be expensive) when all your workload really need is storage–which is generally inexpensive.
Our new implementation automatically upscales the storage capacity of individual nodes that are nearing full capacity. This is achieved by dynamically adding new EBS volumes to the nodes. This approach has the following advantages compared to adding a new instance.
- Adding an additional EBS volume is a lot cheaper than adding an instance. A 100GB SSD-backed EBS volume, for instance, costs ~1 cent per hour. In comparison, an r3.2xlarge – one of the commonly used instances for QDS clusters costs $0.665 per hour. Therefore, adding a new disk instead of a new node is 66x cheaper in this case and the price difference would be even higher with more expensive instance types.
- EBS volumes can be added much faster than an additional node. On a lightly loaded instance, a new EBS volume can be attached and mounted in under 10 seconds. Addition of a new node takes ~4-6 minutes. This allows QDS to delay the upscaling until later, thus potentially saving costs.
- Adding a new instance adds to only the HDFS capacity of a cluster. However, some workloads use a lot of local disk capacity. For example, when there is mapper output spill, addition of a new node does not help. Instead, expanding the storage of an instance using EBS volumes does the trick.
- With upscaling, you can start each instance with a small amount of EBS storage, adding additional storage capacity when needed. This avoids the need to over-provision storage for a worst-case scenario. Thereby, reducing the cost of running the cluster.
- Upscaling only adds disks to cluster nodes that are nearing full capacity. This allows running larger clusters because you avoid hitting the volume limit with larger number of nodes.
EBS-based storage upscaling is independent of, and complementary to, the compute-based upscaling. Storage may be added to existing instances at the same time when instances are being added to accommodate additional compute workload. This reduces the risk of running out of space while upscaling.
How it works
The bedrock of this solution is the Logical Volume Manager (LVM) feature of Linux. LVM lets us do many interesting things with disks. For example, dynamic expansion/reduction of storage, caching, striping, etc. Upscaling relies on dynamic expansion. Here’s how:
- A volume group is created from the initial set of EBS volume(s)
- A logical volume is created spanning the entire volume group
- The Linux file system and HDFS metadata are created in this logical volume
And here are the two triggers for capacity upscaling:
- Based on a fixed threshold – Additional disk is attached when free capacity of the logical volume falls below a threshold
- Based on progress rate of capacity – Based on historical increase in used capacity over time, if current progress rate indicates that the volume is nearing full capacity, an additional disk is attached to it.
Once a new volume is attached and the file system resized, the new capacity is immediately available for use in HDFS.
LVM reliability and performance
Traditionally, LVM has not been used extensively with HDFS. In fact, on bare metal clusters, it is recommended otherwise. One of the primary reasons for this is that in an LVM volume, even losing one of the disks can make the entire volume unusable. This, on the other hand, is not the case with EBS volumes because they are network drives and therefore do not have the same level of risk. (Note: AWS claims they’re 20x as reliable as physical drives.)
There is also a perception that LVM adds unacceptable performance overhead. However, in our tests we’ve found the difference to be insignificant. Here are the numbers with bonnie++, for example:
|Write (KB/s)||Rewrite (KB/s)||Read (KB/s)|
|EBS volume without LVM||91,391||56,512||117,283|
|EBS volume with LVM||91,420||56,470||116,434|
(Note: The tests above were performed on an m4.xlarge machine with a single 100GB SSD EBS volume)
We also ran a 100GB Teragen on a 5-node cluster with and without LVM, and the difference was minimal — 258secs without LVM vs 259secs with it. This indicates that the traditional objections to LVM do not apply to HDFS clusters using EBS volumes.
We’ve taken this approach compared to volume management functionality of HDFS because:
- Hadoop on QDS is based on Apache Hadoop 2.6.0 (+ fixes & features from newer versions) which does not have this feature.
- HDFS volume management does not help if there is a lot of non-HDFS data causing an instance to run out of storage.
EBS-volume based storage autoscaling in QDS offers simple and very cost-effective way to dynamically increase the capacity of instances in a cluster.
In the near future, we will explore LVM support for striping disks and tiered caching to reap performance benefits. For example, using ephemeral disks as a cache for data on EBS volumes might be a great option for instances with very little ephemeral storage such as m3.xlarge or c3.xlarge. This may lead to performance benefits on some workloads, especially with our disk-based caching solution, Rubix.
To enable this feature in your existing QDS account, please read detailed documentation.
 Amazon does offer Storage Optimized instance types with a lot of dedicated ephemeral storage, namely the d2 and i2 types. However these tend to be rather expensive and for this reason have not found much traction among our customers.