Best Practices for Hadoop in the Cloud

Are you considering Hadoop in the cloud? Are you currently facing big data challenges? Are you looking for tools to make your big data operation more efficient?

In this webinar, Matt Aslett of 451 Research and Ashish Thusoo of Qubole (founder of Apache Hive) will outline the benefits of Hadoop implementation, examine common use cases for big data projects of various size and scope, and explain how Hadoop-as-a-Service removes complexity while adding much-needed functionality. This webinar will help you understand the value of Hadoop data management using a cloud-native platform like Qubole.

Webinar Transcription

Ali: Hello, everyone. Welcome. My name is Ali Haeri from Qubole. First of all, thank you for setting out the time to tune into today’s webinar. We’re really excited about the content that we have ready for you today.

Today we have a webinar in partnership with our friends over at 451 Research. The webinar is entitled “Hadoop in the clouds – how to make an elephant fly.” It’s going to be quite informative.

I’d just like to introduce the speakers in a moment. First I’d just like to say that the materials will be made available to everyone tuning in today after the webinar. We’ll provide the slides. We’ll have a full video on-demand to view this whole presentation whenever you wish, as well it will be distributed to everyone attending after the actual webinar, so please stay tuned for that.
To introduce our panelists, I’m very excited to have both of these guys here with us today to speak on Hadoop. They’re both experts on the topic.

First we have Matt Aslett. He’s a research director of data management and analytics within 451 Research’s information management practice. His overall responsibility is for the coverage of operational and analytical databases, data integration, data quality and business intelligence. His own primary area of focus specifically is on relational/non-relational databases, data warehousing, data caching, and specifically, the reason why we’re all here today, Hadoop.

We also have Ashish Thusoo on the line, who’s the co-founder and CEO of Qubole. Most notably, before Qubole, Ashish ran the data infrastructure team at Facebook where under his leadership the team created one of the largest data-processing analytics platforms in the world. In the process, he helped drive the creation of tools, technology and templates for successful big data deployments. Most notably, he is the co-creator of Apache Hive.

Without further ado, I’m going to pass this along to Matt Aslett first to give us an overview of Hadoop, specifically from the perspective of research. Matt, please take it away.

Matt: Thanks, Ali, and again, thank you from us at 451 Research for joining us today. I’m just going to give a general introduction, 451 Research’s perspective on Hadoop and in particular the drivers behind the adoption of Hadoop, including, of course, Hadoop in the cloud. I’m also looking at some of the cost implications of running Hadoop in the cloud versus on-premises deployments, looking at also some of the barriers to adoption of Hadoop both on-premises and in the cloud, and in particular how we believe they’re being overcome by the emergence of new managed service offerings and providers.

Before we get into that, just very briefly, I wanted to give you an introduction to 451 Group and 451 Research a way that not everybody’s come across the company before.

The 451 Group is the parent company. 451 Research is part of that along with two sister companies, the Uptime Institute, which is focused on the data center industry in particular, and also Yankee Group, one of a number of acquisitions we’ve made over the years, you can see along the bottom here.

Overall, we’ve got about 270 people and about 1,500 client organizations. In 451 Research, in particular, so that’s kind of the original part of the business, industry analysts focus particularly on the emerging technology market segments.

As you can see here, we’ve got about 800 company subscribers, about 8,000 individual subscribers, and one of the most important numbers in relation to the firm, 2,000-plus what we call “enterprise IT network members.” Those are people within enterprises and other organizations that are working with these technologies day to day and who engage with us as part of our ongoing research.

Ali’s already given an introduction to myself, but here are details again, and in particular, the details you can contact me if you have any further questions obviously beyond the Q&A and anything about 451 Research.

To get into the presentation itself, I’m starting at a high level. Of course, I’m sure a lot of people on the call have a fairly good understanding of Hadoop, but there may be some who just heard the phrase and are interested to learn more.

What is Hadoop?

It’s obviously a distributed data storage and processing platform. There’s three core components. There’s the HDFS distributed file system, the MapReduce distributed processing engine that runs on top of that, and then more recently, we’ve seen the emergence of YARN, which actually enables mixed workloads to run on the HDFS file system at the same time, and in doing so means that MapReduce becomes one of multiple engines that might run on top of HDFS rather than the primary processing engine.

In terms of the key benefits, and I’ll go into these in more detail, we certainly see that low cost is a significant one. Hadoop itself is open source and it also runs on commodity scale-out infrastructure, as we’ll talk about, definitely a low-cost option for running new and emerging datasets. Also what we see as important is its flexibility, particularly in terms of the schema-on-read approach to data processing and analytics. As I said, I’ll go into more detail on that as well.

To start with cost. Here’s a quote. This is from a presentation from the global head of architecture at a global bank, one of the major financial services organizations, definitely a company you would know well. As you can see, the key part of why I picked out this quote was the description of the cost that Hadoop comes in as being transformational.

This financial services firm is still actually in the process of assessing where Hadoop fits within the organization, but what they are convinced by is that the cost enables them to do new things and to bring through new projects and to analyze their data in a new way that is really potentially transformational not just for the IT organizations but for the business as a whole in terms of, as you see here, driving down operational cost and also improving resource efficiency.

To get an example of the kind of figures that we’re talking about, this comes not from that global bank but from another organization, in particular, a company that provides real-time information and analysis to the media and communications industry.

They’ve disclosed during this year that they have moved from storing 1% of their data to 60 days in a traditional enterprise data warehouse environment at a cost of $100,000 per terabyte to now storing 100% of their data for a year in Hadoop at a cost of $900 per terabyte. By migrating to Hadoop and open source databases, this company identified over $4 million in cost savings over two years.

Of course, not everybody can necessarily achieve those same cost savings, and I think one of the most important factors in these figures is actually the 1% and the 100%. It’s actually about that that level of cost savings enables an organization to really think about saving data that they previously discarded because it wasn’t necessarily immediately identified as value in storing and processing it.

Hadoop gives organizations a platform to be able to store that data, assess it, and analyze it, and that generates potentially new business opportunities, new business intelligence based on data that was previously discarded.

The other key point here is that both these organizations, one already adopted Hadoop, one in the process of evaluating Hadoop, both either retained or looking to retain the use of traditional database/data warehousing technology. Hadoop doesn’t necessarily, and indeed in our view, in most cases, it doesn’t replace those technologies, but as I said, Hadoop and other big data technologies do add cost-effectiveness and flexibility, particularly for taking advantage of data that wasn’t necessarily suitable for storing and processing and analyzing in the data warehouse.

The second point, as I said, is flexibility. The key to this is the idea about applying structural patterns when the data is analyzed rather than when it’s loaded into the database.

In a traditional schema-on-write approach that we see data warehousing environments make use of, the application produces a data, and then you store that data based on a schema that you define based on queries that you know you want to ask of that data. What that means is actually those ones can be very, very efficient at providing responses to those queries, but only those queries you wanted to ask when you define the schema, so they’re pretty inflexible in terms of both asking new questions and also about enabling new data sources to come into that environment.

In Hadoop, and indeed there are other technologies that adopted the schema-on-read approach, this is where you just store the data and you don’t worry about can you schema until you want to ask the question. It’s maybe not as efficient as certain queries, but what it does give you definitely is a much more flexible approach where you can bring in new data, you can explore your data in different ways, you can ask new questions and you can combine data from multiple sources as well and don’t have to be restricted by whether it fits that schema and the structure of the data when it was first stored.

The other side of flexibility is we see organizations looking to store and process and analyze data in its entirety, in multiple formats, and Hadoop enables the storage and processing not just of relational structured data from traditional mainstream applications but also non-relational, semi or unstructured data from sources like weblogs, server logs, maybe even social networking data, documents, video, audio. It depends on the nature of the business and obviously the business case which you bring in, but the key point is this flexibility to not be restricted to structured relational data.

Those are definitely the positive aspects to Hadoop and some of the reasons we see people showing a lot of interest in it. Clearly, as with anything, there are certain barriers to adoption, and one of those is the issue that Hadoop is relatively complex to configure, deploy and manage.

It certainly is an environment where we see that organizations need to have those skills in-house and those are relatively hard to come by at this point, and certainly they do come at a premium. We see organizations that are wanting to dip their toe into the water with Hadoop and try it out, but it’s difficult to do so if you don’t have the skills in-house and you can’t necessarily justify paying a high wage for somebody just to come in and prove that concept. It’s definitely a barrier that we see some companies are struggling to deal with.

Also, we see that a lot of enterprises obviously have made significant investments in existing SQL analysis tools and in particular in staff, analysts themselves and database administrators, etcetera that have skills that run relational databases and SQL analysis tools and databases, and they want to make use of those and they want to bring them to those environments.

That’s something that I think the Hadoop ecosystem is responding to, but at this point still we see organizations trying to understand where Hadoop fits in their data management landscape, what is it for, how does it complement the data warehouse, what kind of applications make sense.

Looking specifically at Hadoop in the cloud, it’s important, I think, to recognize that there’s multiple ways in which you can run Hadoop in a cloud environment. They’re not all the same. Certainly you can spin up Hadoop resources yourself on an infrastructure-as-a-service environment. Also, there are packaged Hadoop-as-a-service offerings available and they’ve been available for multiple years.

Both of those we see, at this point at least, mainly used for development and test environments. As I say, a lot of people dip in that kind of water wanting to figure out where Hadoop fits. Also, of course, with a lot of emerging new businesses, everything they do is on the cloud, so if you’re running Hadoop, it will just be natural to run it in the cloud.

In terms of more mainstream adoption, it is definitely early stages, and I think there is still the issue of the complexity of large-scale deployment and configuration. Just because you’re running from the inter-cloud environment doesn’t necessarily make it any easier to configure and deploy. There is also the issue of cost-effectiveness with large-scale Hadoop adoption. I’ll touch on those in more detail in a moment.

The third option, though, is what we’ve seen emerging in the last year, 18 months, which is more some managed Hadoop-as-a-service providers, organizations that actually offer not just the Hadoop environment but also the configuration and deployment and management of that environment and really potentially take the complexity away from the end user. We’ll talk about how we see those fitting in as well.

The first thing I want to zero in on here is the cost considerations and Hadoop in the cloud. There’s a question we’ve seen coming up a lot recently: when does it make sense to run Hadoop in a cloud environment versus an on-premises environment?

One of the things we have seen over the years is that companies often, well, not often, we’ve certainly seen them in a number of examples of this, companies start off in the cloud environment that’s very good for proof of concepts, as I say, spinning up your own resources on an infrastructure-as-a-service offering, but actually, if you’re using that service to a significant degree, arguably, it begins to make more sense to actually invest in the on-premises infrastructure.

What we see is that if Hadoop workloads are sporadic, on-premises implementations are likely to be underutilized. That’s when the cloud makes sense. On the other hand, if you’re looking at a more regular usage with the cloud implementation, you never actually gain from that theoretical advantage of being able to scale back resources and so potentially your costs go up.

This whole issue of cloud versus on-premises deployment is actually something that we as an organization are looking at in great detail at the moment beyond the Hadoop just cloud in general. We’ve recently published a report, I’ve provided the details here but happy to provide those also after if anyone’s looking for that, a report looking at cost and risk in assessing cloud value.

Certainly, what we see from our research is the key to getting the best value here really lies in forecasted demand, and that in itself is something actually we could put a whole different webinar on that, but just at a high level, these are some of the things we’ve been looking at in terms of the report.

This is a comparison here looking specifically at private and public virtual machine costs, but it translates equally into if you’re running Hadoop on those environments. The target in this example was, theoretical, a user who’s wanting to justify private cloud investment. When we look at best-case scenarios versus worst-case here in the charts, bear in mind that from the perspective of the individual, the best case was to justify private cloud investment.

What we see, as we said, if usage demand is low, so the worst case in this scenario, the private cloud is over-provisioned and the cost will be higher than public cloud. If usage demand is high, in this case the best-case scenario, the public cloud cost will be higher than private. That probably is what you would expect.

What we’ve seen here from the research is the numbers actually do bear this out and not only do they bear it out in terms of the cost but what we found is that that forecast can potentially have a direct impact also on potential revenue and profit generation as well. These are significant choices being made here. It isn’t just an act of finding out which is the most cost-effective. It does have a significant impact on the revenue of the company and the profit generation as well, or at least has the potential to do so.

What we saw, as I mentioned, managed Hadoop providers are acting as almost … protecting the end user from things in terms of the complexity of configuration and management. What we also see is that managed Hadoop providers, what they can do is they take on that risk of under- or over-utilization, so protecting the consumer’s cost impact. You engage with a managed Hadoop service provider and use their service, they have to worry about forecasting how much in a cloud resource they don’t want to use. That’s definitely something we see that potentially organizations are going to be very interested in.

I’ve mentioned deployment and management complexity a couple of times. What we do see, and I mentioned it earlier and will say it again, is that Hadoop is complex to configure, deploy and manage, at least at this stage. That will certainly improve over time, but at this point, it is.
The Hadoop distributions clearly reduce that complexity in terms of dealing with the multiple moving parts and also Hadoop appliances available that help in terms of also the deployment complexity as well in terms of masking some of that configuration complexity.

Where do we see Hadoop in the cloud fitting in here? We certainly see it’s easier to spin up Hadoop resources in the cloud. You could just spin up anything in the cloud, that’s definitely true, but it’s not necessarily easy to configure and manage those resources in the cloud. You still need that skill set to be able to do that.

Also, what we see that’s potentially a challenge is that if you’re on a cloud environment, you as an end user don’t necessarily have control over the resources that are also in that environment and the network traffic. It’s something that people have to be cognizant of and attempt to plan for.

A good example, I think, when we start talking about Hadoop in cloud environments and some of the challenges, one of the things that often happens is people say, “Well, what about X company?” Netflix being a prime example of a company that does definitely run Hadoop in the cloud, and pretty much everything they do is in the cloud, and so it is worth looking at how they do that and seeing what lessons can be learned. They’ve got this 500-node query cluster, a 500-node production cluster, and they’re able to dynamically resize as required and access the same data without replication. It’s a very sophisticated environment.

The important thing, from my perspective looking at this, is why it’s a very sophisticated environment is that this thing I’ve circled here called Genie, which is responsible for job execution and resource configuration and management, it’s a custom Hadoop platform-as-a-service interface. It’s something that Netflix developed themselves and they would completely rely on this.

It is important and it’s interesting to learn from this, but as an end user organization, it’s also important to think about, is this something that you could replicate? It is available as open source, but is it something that you’d be able to configure and manage and deploy yourselves? Could you replicate that?

A lot of the mainstream organizations would be looking at an environment like this and perhaps thinking that that wasn’t where their skill set lies. It’s part of a number of areas that we think need attention, as we say, when you’re thinking about Hadoop in cloud environments. Qubole recently published this document, “Five Areas that Need Your Attention.” I think it’s a really interesting document and one that we show a lot of value in. I just want to give perspective on a couple of the points here.

Elasticity is one of the things that was referenced, and I’ve mentioned it myself. It’s important to note that I think that scalability and elasticity are not the same thing. Cloud certainly enables rapid expansion and elasticity, but contraction needs to be really properly planned.
We had an event recently where the number of organizations running in the cloud and it was quite interesting that the consensus emerged that, yes, you can scale up very, very easily.

Scaling down is getting more difficult. There’s technically reasons, and actually, there’s kind of emotional reasons. You have to have the confidence that you don’t need that resource anymore, so you end up potentially paying for resources you don’t really need just in case you might need them.

Reliability obviously is a significant factor. We see availability as a service as kind of a measure of reliability and trust. It’s important to note, while we look at Netflix as a prime example of a company that’s very successful in terms of running almost everything they do on a cloud environment, even Netflix can’t avoid failures, and famously, on Christmas Eve last year, suffered a significant failure to their service, not necessarily in their control but it was partly to do with their environment, partly to do with the cloud environment, and just the point that that can have a significant impact on the way the organization is viewed in terms of its ability to deliver service and be seen as a reliable supplier or partner.

Self-service is another point and this, I think, is really important when we think about managed Hadoop-as-a-service environment. Key to a lot of the potential opportunities from the whole big data trend is about empowering business analysts and giving them access to new data sources, as I’ve previously discussed, and enabling them to do that without necessarily needing the deep technical skills, write Java programs or whatever it might be.

We see Hadoop as a service as being really key in terms of providing an interface to enable those users to configure environments and analyze data on a self-service basis and also to collaborate, to send the results to their peers and collaborate on analysis of that data.
The final two are monitoring and open source. Monitoring is clearly important. Unfortunately, the cloud doesn’t run itself, and it is important for organizations to monitor those environments and ensure that they are able to meet their service-level expectations.

Again, I think this is an area we see that managed service providers have a key role to play in terms of taking that away from the end user, managing for them and providing alerts obviously and giving them an element of control and knowledge about what’s happening, but doing so based on predefined service-level agreements rather than just bombarding them with information.

Finally, open source. As I mentioned earlier, open source is definitely a benefit of Hadoop in terms of reduced cost. We see that people are concerned about lock-in, and open source can help with that, but we are now going to see that keeping up with that open innovation can be a challenge.

I mentioned the three main components of Hadoop earlier. There’s many, many applications and projects that go to actually make up a Hadoop distribution, and keeping all those moving parts operating together can be a significant challenge. That again is where we think that some of the managed service providers can help take that away from the end user and just present them with an environment that they can gauge with to get up and then very quickly analyze data.

Hence this is where we see managed Hadoop-as-a-service provider fitting into this grid I presented earlier. In particular, what they bring to the table is things like cloud orchestration, automated provisioning, configuration and management, query analysis and visualization tools, the self-service element, and also application development and test environment, so provide an environment which people can quickly get up and running in terms of developing the applications.

As I said, allow the users to concentrate on the analytics instead of keeping that Hadoop environment up and running and mask also that financial risk, as I mentioned earlier, of forecasting the usage of that environment, so take that risk away from the end user and just enable them to get on and engage with the platform.

To conclude, we definitely see increasing demand for Hadoop as a low-cost, flexible data storage and processing platform. We’ve been covering it for a long time. The interest is significant. It’s absolutely very, very real. It’s still early stages, but it is definitely growing.
Interest in Haddop in the cloud is definitely growing as well. Again, even earlier stages, perhaps, we see a lot of primarily for development and test environments, as I say, that it is definitely something that is significant.

The cost and complexity of configuring and deploying Hadoop, particularly in the cloud, can limit large-scale deployments. As we said, Hadoop in the cloud requires provisioning, configuration, job-execution capabilities, and those can be in addition to running on-premises. As we saw with the Netflix example, they had to develop that themselves.

We think that managed Hadoop as a service providers really do have the potential to mask that complexity of adopting Hadoop, particularly as we look at more mainstream organizations without the skills in-house to configure and deploy that. The releaser will be looking for partners to help them do that, and managed Hadoop-as-a-service providers can certainly play a part in that, but they also, as we said, have a significant role to play, we think, in terms of the potential to reduce the cost risk of deploying Hadoop in the cloud and having to forecast the usage levels upfront.

That’s our perspective. I’m happy to answer questions obviously towards the end of the presentations, but for now, I’m going to hand it over to Ashish, who will provide you with Qubole’s perspective on best practices for Hadoop in the cloud based on some of what I’ve already discussed today. Thanks very much.

Ashish: Thanks, Matt, for the great perspective from 451 Research on Hadoop and also Hadoop in the cloud. Hello, folks. My name is Ashish. I’m the CEO and Co-founder of a company called Qubole.

For the past two years, Qubole has been building and running a service in the cloud around Hadoop, and a lot of this presentation is around what we have learned during those two years about what are the best practices of running Hadoop in the cloud.

As you know, Hadoop emerged at a time when people would kind of cluster themselves and cloud was not an option at all. A lot of Hadoop development was geared towards those environments, and over the last two years at Qubole, while developing the service, we have discovered that cloud was very different from running Hadoop on-prem, and a lot of this presentation is going to cover that.

At Qubole, we run today about 8 terabytes of data processed every month, running clusters all the way from four-node clusters to thousand-node clusters for our clients. This perspective distills all that we have learned while running these clusters both at small as well as very large scale in a cloud environment, and specifically the cloud where we operate is the Amazon cloud, so a lot of this perspective is from our learnings on AWS running both small as well as very large scale deployments off Hadoop on the cloud.

Let me dive right into the content of the presentation. There are five key differences between the capabilities of what the cloud provides and what a fixed machine data center or fixed asset data center provides. These are the dimensions which I’ll be covering in this presentation and telling you how Hadoop on the cloud differs from the perspective of each of these dimensions when it comes to cloud vis-à-vis signing up your own clusters.

Cloud has elasticity. We all know about that. That is a big winning point of the cloud. You can grow and shrink your resources depending upon your usage.

Cloud also has multiple mechanisms of provisioning machines. For example, on the Amazon cloud, you can provision machines on the on-demand price or you can even provision machines to the spot market, where you can control the amount that you understand on your machines.

Admittedly, on the spot market, things are much cheaper, but at the same time the tradeoff there is that the machines can be taken away from you if somebody else bids a higher price. In that perspective, how does Hadoop work? It’s a different environment. How do you take advantage of operating and running Hadoop in that environment?

The third part where cloud and on-prem data centers differ are that in the cloud, the architecture as compute and storage are separate. On Amazon, you have S3 as the storage layer, which is separate from EC2, which is the compute layer, whereas if you think about what Hadoop was built for, it was primarily built for environments, and the whole thesis around the building of Hadoop was ‘Let’s keep compute and storage together.’

That was how it was in on-prem environments, but on the cloud environment, things are very different. Storage is really separate from compute and there are advantages of keeping it back there on the cloud. What needs to be done with Hadoop in order to take advantage of that environment, we’ll cover that subsequently.

The fourth part is that most of the object stores on the cloud around storage are eventually consistent, which means that once you write data there, the data will eventually get there, but you may not be able to see the data even after the writer has returned for a period of time. This also has strong implications of how you run and operate Hadoop in an environment like that.

The fifth part is security. There are a lot of features around security which are cloud-specific such as virtual private cloud and so on and so forth. Running Hadoop or a big data infrastructure in that paradigm is also different from what you would see on an on-prem deployment.

Let’s talk about elasticity first. The biggest implication of elasticity is that you can provision machines on demand and you can get them very quickly. In an on-prem environment, provisioning machines itself might take months whereas on the cloud you can provision machines in a matter of minutes.

This basically means that instead of thinking about running clusters all the time, you have an option of running Hadoop clusters on demand. You have an option of even running long-running clusters but sending the files of those clusters as your demand over a period of time.
As was mentioned in Matt’s presentation, Hadoop needs to be augmented with capabilities which allow this to be done in an automatic manner. It brings a lot of benefits with it if you have automatic auto-scaling and self-management of clusters wherein clusters come up on demand in response to query workloads, they scale up and scale down on demand in response to jobs and transformations.

Benefits of Running Hadoop in the Cloud


The big benefits here are that you can of course get rid of the machines which are done, and at the same time you can make sure that your cloud resources are being used to an optimal manner and you are not paying for resources that are idle there. These kinds of capabilities are just not available in an on-prem sort of a setting.

For Hadoop, the implication is this is some of the work that people have also done around auto-scale Hadoop, which is by far the only auto-scale Hadoop which is available as a service that we know. With auto-scaling, all this is taken care of by the software itself and you can really be sure that cloud resources are being used in an optimal manner.

That is about elasticity. Moving on to spot pricing. Amazon cloud has the ability to procure machines in different ways, and one of the most interesting ways which we see across the board used by a lot of our clients, including companies like Pinterest and Quora and so on and so forth, use or have used spot market very heavily, primarily because you can get the machines that are the fraction of the cost. Sometimes the cost fraction is 1/10th of the cost of an on-demand machine.

The tradeoff there is that these machines can be taken away at any time without any heads-up from Amazon. Primarily, the spot market, the way it works is a lot of people bid for machines, and if somebody’s bid goes above your bid, they can take the machine and give provision to them. Of course, after cleaning out everything, all the excess stack have been used by the current client.

Now, different implications of how you trade off the cost benefits of spot machines. How do you evolve your surfaces of Hadoop to make sure that it doesn’t fall apart when the machines are taken away at any time?

Best Practices

Some of the best practices that we have seen emerge is, depending upon workloads, you can play a certain split of percentages between on-demand and spot nodes within the clusters. By controlling the split, you can trade off between the unpredictability of machines being taken away as well as the cost benefits of using the spot market.

In order to minimize the effects of predictability, there are significant implications within Hadoop. For example, the Hadoop clusters have to be aware of what are the spot nodes and what are the on-demand nodes. They have to be aware of making sure that all their replicas are available on the on-demand node so that even if the spot nodes are taken away, the whole cluster doesn’t fall apart.

These are all things that have to be built into Hadoop in order to take advantage of the spot market, and this is, again, something different on the cloud, a very different environment than what you would get on an on-premises sort of a data center approach.

Moving on, the compute and storage is separate on the cloud. This is one of the biggest implications that it has for Hadoop. On the cloud, the best practice is to store the data on object stores and not HDFS. The primary reason for this is that object stores such as S3 are cheaper and more cost-effective as opposed to running your clusters and paying for the CPU for any of those clusters in order to just keep HDFS up and running.

Because of these cost benefits, most of the bulk data, such as log data that is being collected on the cloud and so on and so forth, is dumped into object stores such as S3 on the Amazon cloud. Also, the is much higher. They promise something like eleven 9’s in terms of data durability as compared to HDFS. There’s benefits from that angle as well, the primary benefit, though, being that you can scale your storage independently of your compute, and that gives you a lot of cost benefits as well.

Now in order to really operate in this particular environment, there’s the performance scale, because once the storage and compute are separate, even though the raw performance within the S3 and HDFS which we have found in our internal tests will be similar, there is a lot of variance in performance in S3 versus the performance in HDFS.

Hadoop deployment working in this environment needs to build this gap in order to mask cloud experience, and some of the techniques that we have come up with at Qubole in order to mask this gap is to build caches in between the clusters that we bring up on demand and S3 so that this latency, so the data can be cached into HDFS as it is being guarded into S3, and the experience and latency can be masked.

This also is a very, very different environment from this perspective from how Hadoop has been an on-the-cloud infrastructure such as AWS versus how it can be run on on-premises.
The fact that object stores are used heavily also means that you have to live with the fact that many of the object stores are eventually consistent, meaning that once you write to those stores, the data may not be available immediately to be read.

There are mechanisms to minimize the … The eventual consistency is also a separate assumption that Hadoop makes. For example, HDFS is a strongly consistent file system. An S3 bucket, especially if we use the US-East region or the Standard region, is an eventually consistent object store.

Emerging Techniques

There are techniques that have emerged in order to make sure that in the worst case the effects of this are minimized. Those techniques include running the clients in the same zones as your object stores, especially the clients which are spawning off jobs or running queries and so on and so forth.

To a more advanced degree, the techniques around writing the metadata of the objects that are being created into a strongly consistent system as well. For example, Netflix has the DynamoDB. You could also do that in HDFS itself, comparing and storing this metadata with the metadata that is pushed into S3, just to make sure that Hadoop and its components can mask out these effects of eventual consistency. That, again, is very different from what you will get on-prem versus what you get on the cloud.

Security in the Cloud

Finally, let’s talk about security a bit. On the cloud, on HDFS for example, there are lot of security features such as being able to provision virtual private clouds, being able to encrypt on the data on the object stores, and lately, coming up with hardware security modules to give the key management control to their clients and so on and so forth.

In order to take advantage of these features, you have to also think through how Hadoop installations need to be deployed and how they have to be integrated with some of these features. For example, if you have implemented caches, for example on HDFS, do those caches also keep data encrypted, how do you integrate this with HSM and cloud HSM and so on and so forth.

These are other features which are available to the cloud. Security in the cloud fundamentally looks somewhat different from what you would see on an on-premises data center where frankly the parameter is a lot more … since it’s not a very multi-integrated system, the security features are a lot simpler as compared to what you would get on the cloud.

5 Best Practices Overview

In conclusion, I will just go over the five best practices that we have listed here. First, use on-demand clusters to scale up and down with your workloads. That is the scenario where you can really take some benefits of what a cloud provides. Second, use the spot market to procure machines, but tradeoff, the predictability and unpredictability of the spot market versus the cost benefits that you’ll get and dial it into the style of your clusters and the percentage of the clusters that you want to get on the spot market.

Third, use object stores versus storing data in HDFS, primarily from the cost benefit side, so that’s from the durability benefits that you incur from giving object stores. Fourth, protect against eventual consistency of object stores, and fifth, integrate with cloud security in order to get a secure environment while running Hadoop in the cloud.

With that, I’ll conclude my presentation and pass it back to Ali to carry on with the Q&A.

Question and Answer Session

Ali: Thank you to the both of you, first of all, very insightful, and I really appreciate all of the thought that went into this presentation.

Yes, as mentioned, we have a Q&A session now, and we’ve already received quite a few questions. We’ll try to get to as many as we can here, starting with the first one we’re asked: What’s the typical resource allocated to a VM in this instance, and to be more particular, resources referenced saying how much vCPU and the memory is assigned to a VM with regards to Hadoop? Would either one of you like to tackle that one first?
Ashish: Sure. I don’t know whether this is pertaining to Matt’s numbers or if it’s pertaining to Hadoop in the cloud question, but we see a lot of different types of machines being used for bringing up Hadoop clusters. By default that is being used is m1.xlarge machines, which correspond to full vCPUs and about 15 GB of memory. However, there are workloads which go much higher than that: c1.xlarges are higher than CPUs; cc2.8xlarges are really higher machines which have 32 vCPUs and around 64 GB of RAM with 10-gigabit network.
There are a lot of options in terms of what you can choose to run your clusters on. It really depends a lot on the workloads. In our experience, m1.xlarges provide the biggest coverage for most of the types of workloads, but specialized workloads can go higher on the CPU.
Ali: Great. Thank you.
Matt: That wasn’t in specific relation to the data I’ve presented. That actually was taken from a much larger report that we have that was actually just published today, which discusses various kind of configurations and options. Unfortunately, I don’t have at hand the details of what that specific example related to, but obviously if we can get the details we can make that report available.
Ali: Great. The next question actually is directed backs to you, Matt, and it’s simply, what is the operational database? Can you provide an example of an application running operational database running on cloud infrastructure?
Matt: Yes. I don’t use that so much. Let me talk about operational databases. I guess we’re talking about non-analytic databases, essentially, what people would traditionally have referred to as kind of transactional databases, but we tend to use the phrase “operational” because, particularly if we’re looking at NoSQL kind of database environments, we’re not necessarily talking about transactional applications but we are talking about operational versus analytic.
Those could be web-facing applications, could be retail environment, it could be session data. We’re talking about operational data, not necessarily transactional, although, again, it could be that transactional part of the application as well.
Ali: Great. Thank you. Ashish, this next one is for you. It’s, how useful is managed Hadoop-as-a-service for power users? How easy or difficult is it for power users to customize their deployment of Hadoop deployment, and in essence, while it is easy to see the advantage for managed Hadoop on less power users, again, how useful is it for power users?
Ashish: I think it is very useful for power users as well. Of course, the big benefit …
It depends upon what you define as power users. If you power users from the perspective of the users of Hadoop, it makes all the more sense because with a managed Hadoop as-a-service, all the infrastructure parts of Hadoop are completely taken care of.
You can still write complex MapReduce programs, you can write your own scripts, you can do exactly the same type of things that you can do with an on-prem Hadoop deployment, except that you don’t have to keep running to your operational teams to provision more loads or you don’t have to keep running to your operational teams to figure out if your Hadoop clusters are down or up.
Irrespective of the abilities of the users in terms of whether they’re power users or they are simple SQL users or they’re coming to other interfaces such as ODBC or so on and so fort, the fact that a big simplification of the infrastructure takes place somewhere, so that’s why that makes it very relevant to use managed Hadoop as-a-service even for power users.
Of course, on the managed Hadoop-as-a-service side, enough flexibility has to be provided so that all the benefits in terms of running specific codes and so on and so forth that people can do with on-prem Hadoop are also provided as part of the managed Hadoop-as-a-service paradigm.
Ali: Great. Thank you again for the question and for the answer. The next question is for Matt. Can you provide any information on where to find detailed use-cases that suits Hadoop in the field of specifically telecommunications and insurance services? It’s an awfully detailed question.
Matt: Yeah. There’s a fair amount of information out there, I think. Obviously if you look around the Hadoop venders, distributors themselves and service providers themselves that you can find some good information. Particularly, obviously telecommunications and insurance are areas that we see a lot of adoption.
Also, if you look at some of the Hadoop meet-ups and Hadoop user group events, you can find a lot of good information there. Obviously we’ve got some sources of information ourselves, so, again, that’s something to follow up rather than to provide you with links and stuff here, but yes, if you want to follow up with us, we can point you in the right direction offline.
Ali: Great. Thank you for that. Again, that is from the firm 451 Research. The next question is back to Ashish. Our question asks: If I understand it correctly, you are hosting Hadoop instances on AWS infrastructure; is it possible to integrate other tools, open source tools, into Hadoop instances?
Ashish: Right. If the question is being able to provide open source libraries for … making open source libraries available to Hadoop jobs and Hadoop instances, then yes, it is completely possible to do that. In Qubole, we have things like boot scripts that can be run in order to provide open source tools to be available … or the libraries to be available to Hadoop jobs. A lot of flexibility is maintained through that mechanism.
On the very bottom of … at the level of Hadoop itself, the machines themselves are also launched in the AWS accounts of the clients, so they have full control of all the machines as well. In case of specific instances, we can also enable even trying to gain control of the machines itself.
There’s a lot of flexibility built into the Qubole platform which allows you to run your own libraries and other open source libraries as well within the instances of your Hadoop jobs.
Ali: Great. Moving on to our next question, it asks, are there any numbers or limitations for I/O operations on the block storage? Have you done some performance testing on the cloud? So a performance question.
Ashish: We have done a bunch of performance-testing on the cloud, and I think there are a few blog posts also written about that on Qubole itself. Typically, it’s been primarily, especially Hadoop, since it’s a use-case which does a lot of sequential processing as opposed to random lookups, the [tuple 00:54:59] numbers become the most important things to look at.
The tuple numbers are definitely … On S3, just raw tuple numbers are usually slower than what you would get because of an attached disk. However, what we find interestingly in the Hadoop stack is that a lot of work mix is on the upper part of the compute stack.
For example, we had done some benchmarking with Hadoop running on EC2 using HDFS and Hadoop running on EC2 using S3 as storage, and we found that there was not much difference in terms of raw tuple performance between the two. That means the bottleneck was not there. The bottleneck was primarily in the stack because they are doing most of the computation.
Ali: Great. Two more questions here. The next question is asking, when you need to upload data to Hadoop in the cloud, what is the best practice? Is it possible to use Apache Flume?
Ashish: Sure. We get asked this question quite a lot. Interestingly, yes, there are a bunch of projects that people use. Flume is certainly one of them. Sky was another one.
However, recently, Amazon also announced this project called Kinesis. I believe that’s still in beta, it’s not completely available, but people definitely check that out on the Amazon cloud. That could become a way of dumping data into S3. Once you get data into S3, then Hadoop on the cloud becomes very easy.
It’s basically how you get your data into S3, whether you do it through a bunch of S3 utilities, whether you do it through tools such as Apache Flume, or with Kinesis, I believe that could be a much stronger and reliable service where you don’t have to do any operational management off that infrastructure and could use that just as a log collection service, which dumps that data into S3 and then Hadoop can get that from S3.
Ali: Great. Our final question is directed towards Matt, but Ashish, feel free to jump in. It’s asking, is there a rule-of-thumb default recommendation for database technology specifically for flexible analytics? An example given is HBase versus Cassandra. Any recommendations on that? Matt, are you still on the line?
Matt: Yeah. Sorry, I was muted and also thinking.
Ali: No problem.
Matt: I think one of the things we definitely see around the choices in this environment, it actually very much depends on the individual organization, the skills they have in-house, the technology they’ve used before. Obviously, there are certain workload applications that perhaps lend themselves more towards HBase or Cassandra, or some of that Riak for that matter, but often it’s actually tied up in what the company has in-house, where their skillset lies.
We definitely see HBase. I think HBase has the adoption that’s grown considerably in the last couple of years that we’ve seen because it is obviously part of the Hadoop stack, but Cassandra, still going strong. As I say, it very much depends on the individual use-case. It’s perhaps something to issue to organizations, they are going to make a choice there, actually download them, play around with them, figure out what matches both the technology case and the skillset that you have in terms of being able to manage those technologies.
Ashish: If I may interject there, I see …
Ali: Please.
Ashish: … HBase and Cassandra as great technologies for two types of workloads. One is serving. They’re great for point lookups. If your analytics are around point lookups, then they are great tools to actually use.
A lot of analytics which is around finding patterns with large sets of data, that I think is more tailor-made for just plain old Hadoop, because there are no point lookups there, you’re aggregating vast quantities of data, and there, I think Hadoop plays a much, much bigger role, whereas HBase and Cassandra play a lot more bigger role in analytics such as point lookups or if you are serving OLTP workloads and things like that. Around search sort of applications, there are a lot of point lookups. That’s what I have seen in the past.
Ali: Great. Thank you for the answers, and thank you to everyone attending and for these awesome questions. Everything was very insightful.
That concludes our webinar here. I just wanted to say, again, a bit of a reminder: After this webinar, stay tuned. You’ll be receiving an email with a link to an on-demand version of this webinar, as well as the slides that you saw in today’s webinar.