Migrating Big Data to the Cloud: WANdisco, GigaOM and Qubole


Speaker 1: Good morning, everybody. We’ll start the webinar in two minutes.

Heather: Hello, everyone. This is Heather, Events Manager here at WANdisco. Welcome to today’s webinar Migrating Your Big Data Infrastructure to the Cloud. In today’s webinar, we will discuss what’s required to avoid the downtime and business disruption that often accompanies cloud migration products. Today’s presenter is William McKnight. William is an analyst for GigaOM Research. We will also hear from Paul Scott Murphy WANdisco VP of Product Management and Shane Quan, Qubole Director of Product Management.

Before we get started please note that this presentation will be available for replay immediately after it concludes. You will receive a link to the webinar replay via email or you can simply return to this page. If you’re experiencing any technical difficulties, please, use the help link at the bottom right of the viewing screen or email [email protected]. You may submit questions at any time using the ‘ask a question’ button at the top of the viewing screen. Questions will be answered at the end of the session. If we can’t get to all of them we will follow up with you after the webinar.

Please, take a moment at the end of today’s webinar to rate it using the ‘rate this’ button. We would also love to read any comments you have. Now I’ll hand it off to our presenter William.

William McKnight: Thank You, Heather, and welcome, everybody. We have an action-packed hour ahead for you addressing a very important topic on all our minds, and that is the cloud. The cloud is pervasive in every technology decision that I’ve been a part of for the past several years. It’s only getting to be more and more that we actually collect cloud options for the software that we’re bringing in for data type projects whether it’s big data as we’re focusing on today, obviously, whether it’s relational legacy data or master data or analytical data or any kind of data really.

The cloud is a strong option for that and it really does behoove you to consider the cloud as an option for this. If you’re locking yourself out from that option today, and some still are, you’re really going to start missing out and you might be overwhelmed. Everybody should be addressing their cloud issues with a cloud strategy, and I know we all have certain elements of our enterprise in the cloud. We must, today, to be viable but that is potentially quite different from a cloud strategy.

What I’m going to present is the result of my research into selecting a platform for big data. The reason I entered this research was because a lot of my clients were saying I want Hadoop. I hear a lot of noise. Everybody’s saying different things. Some people are talking over my head out there in regards to this and I know it’s not the only thing. I know it’s going to have to co-exist and I want a minimum viable product for Hadoop. I want A to Z, I want to get there as quickly as possible. Because we’re agile. We’ve already adopted agile, we want to be agile with our big data. We want to get there quickly and we know we’re going to iterate. But how do we get there quickly?

We don’t want to sacrifice long-term scalability in getting things up and running quickly. I set out to find that happy medium, that happy balance, and I’m going to share some of that with you today along with my colleagues that are going to come on from WANdisco and Qubole. A little bit about me, my book came out this week actually called Integrating Hadoop and it’s about integrating data into Hadoop and integrating Hadoop into a real enterprise. I have a lot of good data experience there. Some of that will come out. We are on the way to 50 billion connected devices by 2020.

Now, this isn’t the entirety of big data, of course, connected devices the whole internet of things play. We see a few of the items on here that are part of that but it’s a strong part of why we need to start collecting big data. There’s value in big data. There’s value that is going to be the value proposition that we actually compete on in the ensuing decade. All the other things we are expected to have done, we are expected to have a good data warehouse or two or three in our enterprises today. We’re expected to be doing something beyond reporting. We’re expected to be doing deeper levels of that we call analytics. We expect to be looking forward not always behind. It’s these connected devices I should say that’s throwing off all manner of big data. The device count is just growing. It’s much more rapid than any of us are really aware of.

On this image we see things that go into our bodies, things on our feet, obviously. Some of these are health-related, a lot of it is, but certainly by no extent is that exclusive to health. Big Data is a big part of Corporate Data. It’s becoming an increasingly larger part of Corporate Data. Now, we’ve been living in the world, I been living in the world, of transactional and application data for the past couple decades but there’s a lot more data out there, and machine data, social data, enterprise content, content that we haven’t decided to really organize until today, that is big. We’re starting to collect back in our enterprises and get it under management.

That’s what I always say about the data. It may be there but it’s under management. Can you access it? Is it clean? Does it perform? Does it integrate with the rest of your enterprise? Those are some of the clues as to whether you actually are getting value out of that data. Increasingly we’re going to be turning our attention to machine data, social data and enterprise content. Now, you think about machine data and social data that really didn’t exist a few years ago. Well, it existed in small pockets but it didn’t exist nearly at the level that it’s at today and at the level that it’s going. It’s going even more and it’s going exponential on us.

This is data that we didn’t think about having until a few years ago. Machines were not throwing off data until then, social data did not exist, we may have had thoughts but we didn’t put them down somewhere in digital form where it could be harvested. All of this is forcing us to really take a mindset that Big Data is important to us. I’d really like to see that mindset taking hold at a client because then I know that the things that I want to do with that client for them in regards to their big data, the skids are going to be so much more greased when that data leadership mindset is embraced.

Think about your own organizations. Does it execute strategy with the knowledge that information is an asset to this company? It must be well-protected, it must be managed, must be distributed. Those are things that really help out your projects. We are in the business of data. You are in the business of data. Your information is exploding, it’s real-time, all the time. Now some of you might say, “Well, some of our stuff isn’t real-time.” Well, you may not be being asked for real-time data, you may not be being asked for a lot of this. But that’s not the final solution. You have to pro beyond what you’re being asked to do in IT or in your technology pockets and help the users with what the possibilities are.

You’ve got to be bringing possibilities thinking to your organization. Because your information will differentiate you from your competitors. It will be the source of competitive advantage in the near future as I’ve mentioned. Your information quality impacts everything; your clients, your associates and your shareholders. Information is used and reused and it’s multiplied sometimes throughout the organization in different ways. Information usage drives data value.

Information is a key business asset. That’s the bottom line. That’s what I want organizations to really think about because that really helps out doing a lot of these things. It’s the next natural resource. Think about our natural resources; sunshine, water, oil, we can argue about that, but these things that are fairly prevalent that we are dependent upon. We are now getting that way with data. We will not run out of data, obviously. It keeps going. But we may be overwhelmed by it if we’re not prepared for it. If we’re not prepared to take on the data that could help us out.

In almost every organization, a great dollar spent towards great information management is one of the very best uses for that dollar in the organization. It doesn’t matter if you’re level maturity of four out of five already and some more. They’re still investing because they see where the future lies. They see that they need to. They see this ROI there in investing in information. Please, think about that. Now, I call this the no-reference architecture because it’s not laminated. Everybody’s got a messy sheet of paper. Nobody says, “Come on in here. We’ve got nothing,” and architect. We have to start with where you are.

Now, this is a very basic architecture. You might look at this and say, “Hey, I wish I was there.” Nobody is here as clean as this, but a lot of us have data warehouses. We have data marts that are spun off from those warehouses. We might have cubes in that BI ecosystem, if you will. We might have some columnar databases. We might have some in-memory databases or we might have columnar in-memory or columnar in-memory data warehouses or what have you. All combinations are out there. Believe me, they are. You might also have some master data. You might also be bringing in some third-party data.

This is great for today or maybe the past few years, but let’s move on a little bit here. We’ve got Hadoop now in the picture. I added that to this picture. It looks pretty big. Yes, it will eventually be bigger than your relational row-based data warehouses. It may not be bigger and important right away, but it will soon become pretty important. Again, I think it will become the point of competitive advantage what we put in Hadoop. A lot of us are yearning for Hadoop. How do we get there quickly? That’s what the paper’s about. That’s what my colleagues are going to talk to you about here in this hour.

We can get there quickly. We can get there effectively for the future. Now, just to finish this off, I’m going to bring in some graph databases, some no sequel in the operational area, some data stream processing, potentially. Finally, I threw something over the top here, the cloud. Every single thing you see on here there’s an added dimension to your decision process, and that is the cloud. It’s not just the yes or no, it’s a how. Because you got your public cloud, private cloud, hybrid cloud and so on. You can put all manner of applications in the cloud and you can have multiple clouds just to complicate matters.

This is where your cloud strategy comes in. That’s what I want you to adopt and think about because you’re going to need it. It’s going to really fundamentally change how we do technology for the next several decades, exponentially from where we have been. I believe that completes my no-reference architecture, but before I leave I want to point out that there’s a lot of things on here and you might be saying, “Well, do I need all of this?”

Well, maybe not, but big organizations are going to have all of this and probably need all of this. Because what we need to do is isolate our workloads, put some ring fences around our workloads and put them in the platform that they’re going to succeed in. Success, we can define different ways, but number one is going to be price performance, and scalability and things like this. We want to put our workloads in the platform and there is no one size fits all. If you’re still doing that next request that you get tomorrow, if you automatically think, “Why I got to do it the same way I’ve been doing the last 100 requests?”

Stop yourself and think, is there a better way? Is this the straw that breaks the camel’s back? Or maybe let me put it more positively, is the request that enables the change in our organization that we’re going to need and maybe start moving in a different direction from something that may be simple but it may not be as effective as it can be. It’s going, by the way on that point, it’s going to get more complicated before it gets less complicated in data. We’re not consolidating. We’re that accordion that’s being opened right now. There’s a need for data science. There’s a need for data architecture. There’s a crying need for these things.

Now, Hadoop. Hadoop has fallen into several patterns that we like, that we think are good uses for Hadoop. A data refinery that’s where we might do some of our ETL, specifically our data warehouse ETL inside of Hadoop, occasionally replacing some ETL tools or at least some ETL processes out there. That’s great. Archive storage. Now, this is at the end of the data chain, right? This is after it’s been in the data warehouse. It’s colder data so we just move it on to Hadoop. That’s a dirty little secret about Hadoop which is that’s the way it’s getting into a lot of organizations and it’s perfectly valid.

Hub and spoke it really activates Hadoop and says, “This is not cold data. This is warm data that we’re going to use.” It’s pre-data warehouse with varying levels of data access at the Hadoop hub that is distributing the data out in different places. We call this data science. Data science is what we’re doing to that data. We could also call the data science a laboratory which is where we’re collecting data. This is known commonly as the data lake. It’s where we’re collecting all manner of enterprise data into a Hadoop cluster.

What we’re finding is that the true data scientists, which are few and far between, we need more, but the few and far between data scientists that are taking advantage of Hadoop don’t need those same levels of refinement as our users have needed over the course of years. They know what to do with the data. We just got to get it together. Our data scientists are not true data scientists, they’re not stepping up to their challenges, well, if they’re spending any percent of their time collecting data. That’s somebody else’s job. That’s our job, the people on this call that are phoning in to learn about Hadoop. We’re setting up Hadoop for them.

Now, there’s also a pattern of Hadoop as the data warehouse itself. I’m skeptical of this pattern right now. I see a lot of utility in the relational database for your “data warehouse.” We can really get into this. I’m going to move on, but some are adopting Hadoop as a data warehouse and, frankly, some have been successful at it. We’ll see. Watch that space. But for now, I want to talk about what makes up a Hadoop cluster. It’s those nodes. There can be different profiles of nodes within the cluster. This is really key. Again, we’re talking about a minimum viable product for Hadoop. That’s what we’re building up to. We want to lay it out quickly, but we want to lay it out effectively.

It’s important to know this that nodes can have different profiles. There is this thing called blob storage which you can avail yourself of only if you’re in the cloud. This is where you’re separating the storage from the compute nodes. This way you don’t have to have the nodes that are doing both up all the time. You can down-spec the nodes from a storage perspective. You can shut them down easily. It’s just a small performance hit for the advantage of up-down nodes and cost savings. A way to save a lot of money on a Hadoop cluster is to think about the different profiles of the nodes that you can have and to really look into this thing called blob storage or into a tool that helps you get into that effectively.

Now, no conversation about Hadoop is complete today, without talking about how are you going to get at that data. MapReduce was the way. It is cumbersome. It is slow, relatively speaking, to spark which seems to be the way that’s taking off. We’re all about it here. It’s out of UC Berkeley. It can operate on any Hadoop source. Holds queries intermediate results in memory, not on disk, which reduces your query time. What it effectively does is it’s an algorithm that does all sub-first levels of processing on a query within memory. Taking advantage of memory just to be quick about it.

Consequently, a lot of us are up-specing those nodes now because, hey, let’s give it more memory, we get exponentially more performance out of it. We’re trying to find that perfect balance, that perfect node, that perfect cluster for Hadoop. We’re really paying attention to that now. I’ve always said, Hadoop is more than scale out. It is scale up as well. We have to pay attention to the spec of our clusters. That’s what I’m getting at here. It’s often overlooked but you could succeed or fail based upon the spec of your cluster. Now, Spark doesn’t come without its limitations. If you have a lot of concurrent activity in cluster that’s sucking down a lot of that memory and it’s going to be difficult to get all that memory into Spark which negates its value proposition. I’m just passing this along as a little food for thought here.

Do know this about Spark? If less memory is available due to other factors you see there Spark’s performance degradation curve can be severely non-linear. All that great performance that we’re expecting we may not get and the resources that no Sparking can do this type of engineering are rare and expensive. Yes, they are. We’re still going to do it though, we’re still going to do Hadoop. Infrastructure strategy now including cloud and, by the way, excuse me if my voice is a little hoarse today, hopefully it’s tolerable. There are some benefits to cloud computing on-demand and self-service, broad network access, resource pooling, rapid elasticity, and measured service. These are five the things.

I’ve been dragging this definition around for a few years. It’s still holding true. It’s still the five things that I strive for when I get into cloud computing at a client and this is especially true when it is private cloud. Because public cloud typically provides this but private clouds may or may not. These are good barometers for your private cloud, your so-called private cloud. Is it truly private? Is it giving you these five things that we have deemed to be very important when it comes to the cloud? As you can see here you can put different application domains, you can put different cloud services and different deployment models in the cloud.

Finally, a couple of points about that cluster that you’re building out. Decoupled storage, take advantage of the Cloud provider’s persistent low-cost storage. I got into this before. Some of you know about Amazon S3. This is where you have a lower spec lower cost set of nodes that you can take advantage of for the right workloads. If you have something that’s automatically looking at that cluster and partitioning your workloads across this that’s worth its weight in gold there for you. You only pay for processing resources when they’re actually processing data.

The other thing that you need to be aware of when you step into this, and many are not, it’s automated spot market bidding. Just like a spot market out here in the real world, you’re bidding on resources of nodes for your Hadoop cluster. Amazon does this to obtain a spot. For instance, you bid your price when the market price drops below your specified price your instance launch. This might introduce some delay into your processing and you may be concerned about that but what we found is that approximately half of all Hadoop workloads don’t need to run this minute or that minute that they’re ready to run.

Can be run anytime within the next 24 hour period without negatively affecting their value proposition to the company. That being the case something that takes advantage of decoupled storage and spot market bidding makes a lot of sense. I think this is going to be true for a lot of Hadoop clusters going forward. Now, Hadoop data movement, I just wrote a book on this. There’s a lot to it. There’s a lot more to it than DistCP. That’s built on MapReduce. Again, talked about MapReduce. To snowball is the physical way to send in data, get it loaded in the cluster sounds very archaic, and it is.

There’s dedicated gateways, log-scanning batch tools, and there’s active-transactional, which I find very interesting as a way to get the data into Hadoop and spread out from Hadoop as a hub-and-spoke. I’m going to let my colleagues talk a little bit more about this. Your Hadoop MVP. Let’s say you’re interested. You want to get Hadoop up and running. It seems like there’s 100 decisions to make. Well, there’s four major decisions to make.

Your Hadoop distribution, your cloud service, how you’re going to move data, and I touched on that, and finally something I’m not really talking about here but what is the elegant way to get that data out of Hadoop, and we are saying it’s SQL on Hadoop? Look into that category to get the data out of Hadoop. That’s your Hadoop MVP and this has been my little part of the talk here today. Hope you enjoyed it and I will be back for the Q&A a little bit later. For now, I’m going to pass it along to Paul Scott-Murphy of WANdisco.

Paul Scott-Murphy: Thank you very much, William. This is Paul from WANdisco. I am in charge of project management or all things big data and cloud in our organization. What I want to talk about today is just to expand on some of the detail behind what William’s referred to in the challenges of migrating Big Data to the cloud, how you can take advantage of active transactional replication to solve some of those challenges.

I’m going to talk a little bit about what we mean by active transactional replication, how WANdisco provides that capability in a product that we have called WANdisco Fusion, and explain the differences that are enabled by using this approach to moving data between non-cloud and cloud environments by taking advantage of active transactional replication. Firstly, what do we mean by the challenges of cloud migration?

William’s referred already to many of the benefits that come from using cloud infrastructure particularly for Big Data applications and technologies around Hadoop ecosystem. But one of the challenges that you need to overcome before you can take advantage of those facilities is how do you use the data that you already have available in the cloud? All of your data may not exist in the cloud natively. Of course, you ingest that in a variety of systems many of which exists in your on-premise data centers.

That data has gravity. Moving it to the cloud is a challenge that you need to overcome. The difficulty there, of course, is that if you want to use traditional tools for simply copying data from your on-premise systems to the cloud you will need to disrupt your business operations to do so effectively. That type of downtime is going to have a real and significant impact to your business. In general, you can’t stop the world just to migrate your data to the cloud. The solution needs to be one capable of replicating changing data and that’s what we mean when we talk about active transactional replication. The solution that WANdisco provides is embodied in this product that we call WANdisco Fusion.

What it does is it replicates data across different types of file systems including Hadoop clusters, including cloud object storage, local or network attached file systems, in such a way that accounts for doing that replication when the data is undergoing change. I’ll explain very shortly how we do that. It’s powered by WANdisco’s patented replication engine to enable that active transactional replication to guarantee the consistency of your data regardless of where change occurs. If you’re modifying data on-premise or in the cloud WANdisco Fusion is capable of replicating that data across any number of sites at any distance on a selective basis.

It’s totally non-invasive to your existing infrastructure so it can be added to your Hadoop clusters, to your cloud environments, without modification to those, it’s easy to turn on and off over time. How does it work? It’s a very simple and straightforward process. If a change is made in one environment that the user application makes a change to an existing file or creates a new file in the file system, firstly, that change is coordinated between the different environments using WANdisco’s technology to guarantee the consistency between them. That coordination is then passed along to the underlying storage platform so that the application works against its local system at the local speed.

We have no impact on the speed or performance of operation. Subsequently, Fusion is capable of coordinating the rights that are performed to that local file system to ensure that the data are replicated out to the cloud environment or to the other systems involved. We do that in a way that coordinates that globally so that regardless of where change occurs the changes are guaranteed to be consistent between the different environments. The fusion infrastructure writes that data to the environment involved so that data are available in every location. How is this different from other approaches of replicating information? What does active transactional give you that traditional approaches to copying files do not?

At its core, the active transactional replication operates on a continuous basis. Whenever change occurs in your on-premise environment you can be guaranteed that that same data will be coming available as soon as possible in the cloud environments. The alternative is to do that on a periodic basis where you need to schedule the replication or copying of content by scanning systems and transferring that in batches. That imposes some overhead. DistCP, as William referred to, operates as a MapReduce job that impacts the performance of your cluster. The synchronization that active transactional replication from WANdisco allows you to do so in and a manner that allows you to replicate data between multiple systems rather than just from one source to one target.

Because we do that we allow the destination to operate in a read/write manner as well, where every participating zone in the replication actor as peer of all the others. The consistency of that data is performed in a strongly consistent manner using WANdisco whereas alternative approaches require that to be eventually consistent. That supports use cases for both migration, for disaster recovery, for backup on a continuous basis that is automated. Whereas alternatives require manual intervention. This scales natively to any number of clusters or destinations and scales across multiple cloud environments.

The alternatives, of course, are limited to two sites or become very complex when you want to extend it beyond two sites. The migration use case supported by this active transactional replication, lets you do so without downtime. Because you don’t need to stop the world it makes the approach to migration much more comprehensive, much easier and a lower risk to introduce in an environment where you can do migration while applications continue their normal behavior. The alternative requires you to stop ingest, to stop analytics, to stop processing, that can incur a significant detriment to your business operations.

The benefits that flow out of this when you have active transactional replication in place, firstly, the continuous availability of your systems without degradation to performance. What that means is that you can take advantage of all of your computing resources at all times. if you want to spin up clusters in a cloud environment on a periodic basis or on an ad hoc basis to do short-term analytics across large volumes of data, that can be done when active transactional replication is in place, because we can guarantee that any results that are generated by those ephemeral clusters are available across all environments at all times.

Similarly, if you have long lived computing resources in the cloud, you can take advantage of those rather than sitting idle just for the purposes of disaster recovery. You can ingest and modify data at any location at any time, and you can perform that replication on a selective basis with global access to your information. It also provides some benefits around protecting the data, in terms of minimizing the need to expose network ports and other security risks when replication is occurring. It is flexible and future-proof because of its ability to accommodate different types of storage, different versions of Hadoop to replicate across those versions, rather than being limited to a consistent common infrastructure.

This is how it enables you to support better approaches to cloud migration and to operate hybrid cloud environments, where you take advantage of both on-premise and cloud infrastructure at the same time. To summarize what WANdisco does with this approach to active transactional replication through WANdisco Fusion is that we solve the challenge that comes from the fact that your data will not always exist in the cloud. But when you want to take advantage of the flexibility and elasticity that cloud platforms for Big Data processing enable you to use, WANdisco Fusion through active transactional replication provides exactly that solution.

It lets you migrate to the clouds or take advantage of hybrid environments without downtime, without disruption and without locking you in to a solution that limits your use of different types of cloud resources. Zero downtime active transactional replication between on-premise and cloud environments is really what WANdisco fusion is about. It gives you the facility then to take better advantage of the types of infrastructure that I’ll now hand over to Qubole to talk about in some more detail. Thanks very much.

Jin Kwon: Hey, thank you, Paul. My name is Jin Kwon, I am the Senior Director of Product Management for Qubole. What I’m going to talk about today, I’m going to talk a little bit about Qubole. I’m going to talk about where we fit in to that Hadoop MVP and how we can accelerate Hadoop into a data-driven enterprise. Then, finally, I’m going to talk about the partnership that we have with WANdisco and how that accelerates the migration into the cloud. First, a little bit about us. We are the fastest growing Big Data as a service company focused exclusively on the public cloud. We deploy on Amazon AWS, Azure and Google Cloud Platform.

Our flagship product, Qubole Data Service, is designed to scale across the enterprise. We cater toward both data analysts, data scientists and data engineers and provide the tools and platform that they expect. As an example, for SQL analysts we have SQL and Hadoop options such as high smart SQL and Presto, and then for data engineers we have access to more programmatic and larger data transformation tools such as MapReduce, Pig and Spark. We were founded in 2011 by Ashish Thusoo and Joydeep Sen Sarma who built the Big Data platform at Facebook, and also were the founders of Apache Hive.

Very quickly, our customers span both fast-growing startups such as Pinterest and Lyft as well as well-established enterprises such as Comcast and Autodesk. We operate at cloud scale. We are processing nearly half an exabyte of data every single month. Think about all the software and the operating expertise that needs to go into that amount both in terms of the amount of users as well as the amount of compute usage. You can see compared to some of the leading internet brands, we’re even out stripping them from a data processing perspective. Our vision is self-service Big Data analytics.

What that really means is that we want to empower the data analysts, the data scientists and data engineers to be able to do analysis and make good business decisions. William talked about the value of data and how that is essentially a competitive advantage as the next natural resource. Our vision is we want to scale that ability across an enterprise, because we feel the only way that your company can truly become data-driven is if everybody has access to the data and has the ability to come to their own conclusions.

Now, there are three steps that are required to actually achieve this vision. One is you need to have an agile, flexible and scalable architecture. The second is you need the platform itself to be fairly automated and transparent to the user. Then, finally, you need to have simple yet powerful tools that can be used across user personas. The first step here is the architecture. This is really where we think that big data belongs in the cloud. I’m going to echo something that William talked about which is the original Hadoop model forced the conversions of compute and storage. There were some benefits to this model which is that you can directly go and use commodity hardware.

But the downsides are actually quite significant as well. The downside are that compute and storage have to scale together, which may not necessarily be true. Your processing needs may not change drastically, but your data needs may just grow over time. It also means that the data is not portable. It makes it more difficult to do fast experimentation without having to replicate that data. Then, finally, if you think about a hub-and-spoke model where there are lots of consumers of that data, what this implies is that your compute cluster needs to be persistently on as well as needs to have the available resources from a memory and compute perspective, to be able to support all of those folks.

This is why we feel Big Data belongs in the cloud. With the cloud you can take advantage of object store services such as Amazon S3, Azure Blob Storage, and Google Cloud storage and make that your data lake. What that allows for is number one, it allows for very vast experimentation. For instance, you can start up a very small Spark cluster and read from the same data without having to copy it or replicate it. The second thing is it allows for experimentation on the hardware itself. You may have a certain hardware profile for your usual cluster, but when you try new things you want to be able to quickly try out maybe memory intensive or compute-intensive hardware and without having to order that and go through the typical hardware qualification process.

Then, finally, being in the cloud allows for much greater elasticity and scale. In the older model you have the resources that you have and they are provisioned. What that means is you have to spread out your workload over time. Whereas in the cloud when you wanted your data, you can have thousand node clusters that are sprung up within minutes and then when you are done you can shut them down and rely on your object store and still have access to the underlying data itself.

The second piece of the puzzle is making the platform automatic and transparent to the user. There are two major things that we do here. The first one is we start with the analysts and we work backwards. Here I have an example of a high SQL query that an analyst might perform. You can see all they need to do is submit this to some hardware profile. It’s completely abstracted. When the command is submitted to Qubole Data Service, we automatically deploy the infrastructure and then we automatically shut it down when find that the cluster is idle. All of this is actually transparent to the user. They don’t actually have to think about whether my cluster is on and off, what the sizing is.

All I’m doing is the analysis itself. Another key part of that is the auto-scaling and the automation of the sizing itself. What we have found is that the demand for analysis is inherently unpredictable. On Monday you might find that at 11:00 AM, you have hundreds of concurrent queries coming from your analysts, but later in the week the spike might happen at 4:00 PM. What you would do in the past is you would either provision at a much higher level and not be as efficient in the utilization of the cluster or, more realistically, you would actually ask your users to try to spread out their load over the day. Both are, obviously, not ideal.

In the former case, you’re spending more money and in the latter case, you’re actually blocking your analysis time and opportunity costs. With Qubole, because we are auto-scaling the clusters and doing it dynamically within jobs, there’s no trade-off that needs to be made. In addition to the scaling we also automate the spot purchasing. William talked about the spot market on AWS as a great way to control costs and that’s exactly what we do at Qubole. When we actually look across our customer base, we find that roughly 50% of all compute hours are done on spot instances. Here is just a simple example of how auto-scaling actually works.

The green line here shows the number of concurrent commands that are coming from an analyst group and then the blue line shows the actual number of nodes. You can see with Qubole auto-scaling is essentially replicates the demand while putting the controls in place. This one has a control not to exceed 10 because the administrator had decided that’s the maximum that they wanted to spend. The final thing here is having simple and powerful tools and interfaces. To the reason why this is important is in order to drive a data driven enterprise, you need to make the analysis and the data available across even sometimes nontechnical users.

What we find with Qubole is that there’s a 1 to 21 ratio between administrators and users. What that means is for every one resource that is setting the configurations and putting in the controls in place, there are 21 end consumers that are actually accessing that data. If you compare that with more traditional on-premise setups you’re looking at closer at a one to one ratio. The way that we make this happen is we have the visual interfaces for nearly any type of analysis. I’m going to show you a couple examples.

Here is a notebook where a data scientist can explore data through Spark SQL or through any of the Spark variants and then directly chart the results and be able to interpret the results. Here is an example of, also within the notebook, a simple mapping application where the results of an analysis can be directly mapped via latitude and longitude and then help visualize what’s going on to a nontechnical user. Then, finally, we have the simple interfaces for even more technical users.

Here is an example of the database that is happening and you can see in this example actually, the data is being backed by S3 which is the data lake for this customer and the abstraction is simply there’s a table that I can go query with SQL. Finally, the last thing I’m going to show is a very simple scheduler that we provide so that even analysts can run daily reporting and daily ETL jobs. To summarize, to achieve the vision of self-service analytics, we needed to have the architecture that was scalable, flexible, and agile. We needed to have transparent and automated platform and then, finally, we needed the simple interfaces and tool for analysts to use.

Now, a key part of it is actually being able to migrate your data into the cloud and for some enterprises even maintaining a hybrid on-premises and cloud setup. What we have done is we have partnered with WANdisco. For a limited time actually, we have a special quickstart program where you can migrate up to three terabytes of data to Amazon S3 for no charge and then also there’s no charge for using the WANdisco or Qubole services for three weeks. In the presentations that we send to you after this webinar, we will have a link that takes you to the landing page for this. You can start working with Qubole and WANdisco. Here’s some relevant links but with that, I think we’re going to open up for questions.

Heather: Thank you so much, William, Paul and Jin, for the presentation. At this time we’ll take on some of the questions that have been posted by our audience. The first question is I [unintelligible 00:47:32] for Big Data but some of the patterns make it sound like it’s for legacy tabular data too. Which is it?

William: I’ll come in on that one, Heather. This is William. That’s an interesting observation and it’s absolutely correct. If you asked me a couple years ago about what Hadoop is for, I would tell you non-structured or unstructured or semi-structured data only batch access and that’s it. But today through the help of my clients who always are the ones that illuminate me, and I drag those practices forward, but anyway, they have shown me that there is some value in some relational data and I showed you in several of the patterns where there might be some structure, I should say, structured data that fits into Hadoop.

What we’re finding is that you have to look at the value proposition of it. Not hard and fast rules anymore. When it comes to Hadoop because the Hadoop ecosystem has grown tremendously and products like what we saw here today are out there at your disposal to make it a lot easier so it’s getting muddier. It’s like there’s concentric circles here between the relational world and the Hadoop world now whereas they used to be pretty different.

Heather: Thanks, William. The next question is, what about relational data that I have? I already have MySQL. How can I add that to my data lake?

William: Well, I can come in on that one as well if I understand it correctly. I already have some SQL, how can I add that to my data lake? Well, the data lake is going to be on Hadoop and the data lake is going to obviously contain a lot of your higher volume data. Otherwise, there’s no point in a data lake but you can bring in lower volume data as well. As a matter of fact, it’s pretty smart to do that.

What we’ve been doing is bringing in master data which is obviously relational, which is obviously lower volume. This is your master list of customers products and so on which obviously come out of a relational SQL-based database. But we’ve been bringing it there because the scientists want that hand in hand with the volumes of the big data that they’re analyzing at the same time. So they don’t want to be creating some distributed query where they’re running all over the place. We want it there for them, and so as long as it’s organized and you understand what’s in there and you have some sort of directory to the data, you can put all manner of data into your data lake and be successful.

Jin: Let me add to that. This is Jin from Qubole. I think what we see across our customer base is not just the desire to join the database data with perhaps unstructured data as William mentioned. But also there’s a desire to just do more complex and more scaled analysis on the database itself. What we find is databases very quickly fail when you do large joins and when you scan lots and lots of data. Even that is a use case where you can bring the data from the database directly into a data lake and use one of the tools that is really designed for scale such as Hive or such as MapReduce to more efficiently analyze that data.

Heather: Thank you. Next question. How does WANdisco improve on standard file transfer tools?

Paul: Paul from WANdisco here. That question is clearly for me. I’ve already talked a little about the differences between what WANdisco’s Fusion does with active transcational replication and how alternative file copy mechanisms operate. There are some obvious benefits in terms of performance, in terms of manageability, and the simplicity of using a technology like WANdisco Fusion. The core differences between our approach and other techniques of copying files is that active transactional replication performs the replication on a continuous basis between multiple environments regardless of where the change occurs.

What that means is that you’ve essentially provided a logical single virtual file system like introducing active transactional replication that can span different technologies across multiple sites any distance apart that operates on a continuous basis. The improvements that we add are structural to some extent. We’re proving a different type of service to just copying files. The benefits that come out of that are really where the differences lie. It allows you to implement the types of use cases that simply can’t be done by copying files using other tools.

Heather: Thank you. Next question. William showed a data warehouse in a columnar database and in-memory database in the graphic. Can’t the database data warehouse be columnar in memory or both?

William: Okay, I did, and that’s correct. I think I said something about not trying to distinguish too hard between all these things. Everybody’s different. It’s a no-reference architecture, and clearly there are in-memory columnar databases. Of course, I’m taking us back into the relational world here for a minute, but like SAP HANA and so forth. Yes, absolutely. Could that be your data warehouse? I think it could if you had the right volume and the right level of urgency about the request and you wanted to put that kind of performance wind at your sails, more power to you if you have that kind of data science that requires that. Some of the other appliances, obviously, are columnar in nature.

I actually think that columnar is the most relevant structure for a data warehouse. Unfortunately, living in the real world, most of my clients, when I come there, they already have a data warehouse and it’s probably not on a columnar database. What we look to do maybe turn on some of the columnar features of their database or maybe have multiple data warehouses. Again, moving the data to the best fit platform for the data for the workload, and that’s okay too. Yes, absolutely, your one data warehouse or your one of many data warehouses can be columnar and/or in-memory.

Heather: Thank you. Next question. When Qubole implements its auto-scaling, how do you ensure that the downscaling of clusters is safe?

Jin: This is Jin. That’s a really good question. What we typically find across our customers is they typically understand upscaling because it’s relatively easy to provision new nodes and to do more work. With downscaling, there’s actually a bunch protections that we put in particularly around the working data itself. We try to make sure that any data that actually is temporarily used within HDFS on a node is replicated across some other nodes before we mark the node for deletion.

What that allows for is a relatively stable experience with the job, meaning that even if you’re downscaling within a job, you’re not going to affect the results or the analysis. It also provides, obviously, the cost control so that your cluster can always be the right size based on the workload.

Heather: Thank you. We have time for one more question. Can WANdisco help manage bandwidth when replicating content to the cloud?

Paul: That’s a good question. It gets to the detail behind the difference between continuous replication that WANdisco performs and copying files at a point in time. When you’re copying a file at a specific point in time, you’re controlling when you use your bandwidth. But if replication is taking place continuously, you’ll be consuming bandwidth to do that replication because when change occurs, the replication has to take place. The feature that we have within the products to improve on just copying files as quickly as you can is by defining what we call bandwidth limits policies.

This allows you to say for a replication from a particular site to another site, you can define a limit on the bandwidth that will be consumed so that as you’re replicating content continuously, you ensure that you don’t exceed that limit and the trickle through of data that occurs on a continuous basis gives you then a much more efficient way to do that replication overall.

Heather: Thank you. Thank you, William, Paul, and Jin, for a great and insightful presentation. Thank you to our audience for your time and participation. Please, take a moment to rate today’s webinar using the ‘rate this button’. We would also love to read any comments you have or suggestions for future topics. The full presentation and slides will be available for replay and sharing within a few minutes at Thanks, again, for coming.