Video

×

How Oracle is Using Qubole to Yield 90% Cost Savings

 

Host: Good morning, good afternoon. or good evening, depending on where you are in the world, and welcome to today’s webinar entitled “How Oracle is Using Qubole’s Heterogeneous Cluster Management To Yield 90% Cost-Savings“. Today’s speakers will be Justin Wainwright, Oracle systems analyst, focused on the Qubole platform, and Goden Yao, Qubole’s principal product manager, working on the Qubole products, obviously.

Goden is going to start the conversation today, but I think he’s going to take just a few minutes to talk a little bit about all the exciting things that have been happening between Oracle and Qubole lately. With that, I’ll pass the ball over to Goden.

Goden Yao: Hi, everyone. My name is Goden. I’m a product manager in Qubole. If you’ve been paying attention to our website you probably would know that there’s a lot of things going on between Qubole and Oracle. Yesterday, we launched QDS service, GA on top of Oracle bare metal Cloud. This is exciting step. Qubole, now, actually, operates everything on all the Clouds that we now love.

There’s another work that we’ve been working on with Oracle, which is to do a built-in database connector to Oracle bare metal Cloud database service. We will provide the ability to import structural data from Oracle Cloud DB and also provided the ability to issue queries directly to Oracle Cloud DB from within QDS. The last thing I want to mention is the cluster lifecycle management and other scaling, so this has been a key differentiator for Qubole against other competitors, and with QDS, and Oracle Bare Metal Cloud service, enterprises can now have the ability and agility to start experimenting very quickly with the auto-scaling and the cluster lifecycle management.

Today our topic is about something else. This is about the heterogeneous experiments that we started last year and we were lucky to have Oracle as our first trial customer and today we’re going to share their journey with this feature and how they work with Qubole to achieve the cost-saving and the high availability with their cluster. With that being said, I’m going to move to the agenda, Adam.

I’m going to talk a little bit about Qubole in case you’re not familiar with what we are doing and what kind of services are we providing. Then I’ll talk about something about the Spot instance, which is the– I’m going to use the Amazon EC2 Spot instance market to give you some idea how the spot instance market could be and why we wanted to introduce heterogeneous feature to our customers.

The next thing, we’ll have Justin to talk about his journey with Qubole to enable the heterogeneous cluster feature for his customers across Oracle division. Then I’ll show you a demo. This demo is pretty simple. It’s just the cluster configuration. I’ll tell you how you can configure this feature with Qubole and what that means for each column, and also, the weight and how we actually use the configuration to auto-scale with the heterogeneous cluster.

The last part is the cost-saving comparison. This is more from Qubole’s perspectives, how we see the heterogeneous could potentially save the cost for our customer. Then we’ll go to the Q&A, and so, like Joe just mentioned, you can type your questions through the chat and we’ll allow these questions and answer them towards the end of the presentation.

All right, so what is Qubole? In case you’re not familiar with what we are providing. Qubole in a short sentence is a big data service provided in the Cloud. We actually designed our product for all the personas in the enterprise organization and it covers from data scientists, data analysts, to data engineers and the data admin. If you have a data org in your organization and that they’ve been serving the internal business users as well as they need to set up either on-premise clusters or they need to manage the pipeline, et cetera. These are all our potential users for our platform.

Right now, Qubole is mainly designed and established on the AWS web service. We, also, are working on the Microsoft Azure. We already have some prototype and Cloud customers. We also support a Google Cloud platform and yesterday, which did a GA, with Oracle bare metal Cloud, so that makes us a ready Cloud native, Cloud-optimized, and Cloud agnostic provider.

This year, we are also focusing on Cloud intelligence. This means we would optimize our platform and build more intelligence for our customers, so to give them more flexibility while they have less knobs to tune and everything would work without any further configurations. From the service perspective, we provide Spark, Hadoop, Hive, Pig, and Presto, all these engines popular in the open source community or it’s widely used by enterprise users, so these are all the engines that we provide.

We support all the typical workflow from ETL to ad hoc queries, from machine learning to like a streaming. There’s a lot of innovations in the platform and I would encourage you to check our website and to contact us if you have any questions. That would give you a broader overview of what we are doing and how we can help you. Okay, so that’s for the Qubole.

The next thing I’m going to talk about is– Before I jump into the heterogeneous, I want to give you a quick overview about the Spot instance market, and that actually drives us to think this feature, why the heterogeneous could potentially help our customers more. Here, you can see, this is an AWS EC2 Spot instance in the spot market, a history for the price.

Now, this is a c3.4xlarge and the date was last year, November 21st. The time period here is roughly five hours. You can see how fluctuating the price could be during five hours. It stays under $0.3 per hour both time. Then from, I think, it’s 6:10, it started to jump and almost double or even triple like you can see here at some point in different instances, that’s c3.4xlarge.

Let’s look at another instance type, which is in the next slide. This is the c3.8xlarge, so it’s supposed to be twice as much as strong as the c3.4xlarge, and if you check this price within the same time period, you can see, yes, it’s above the $0.3 per hour a little bit, but it’s not that high compared to the peak hour, the peak time from the previous one.

That’ll give us the opportunity to help our customers say, “Maybe from that period of time when you hit the peak of your default instance type, instead of bidding continuously on your default instance, we could potentially help you with another instance which is cheaper, stronger, more powerful, and can help you to complete the job in a shorter time with a lower price. That’s the motivation and inspiration that we had which drove us into this feature.

The next slide I’m going to show a quick screenshot and which I will be showing in the demo later. If you think about why we consider a heterogeneous, first, what is heterogeneous cluster? It’s very simple. For a homogeneous cluster, you always use the same instance type in your cluster. For the heterogeneous, you could potentially use a different instance type within the same instance family.

In this case, on the screenshot, we can see this is c3.2xlarge. You can use another instance from c3 family, the 4xlarge or 8xlarge or you can use even cross instance family. I can use it up for a three or a four. That is how flexible you can do in this heterogeneous cluster that we provide. Why do we want to enable the feature for our customers? One thing I just mentioned is the cost.

Other instance types could be cheaper at a specific time point compared to your default instance type. Now, in that case, Qubole’s platform can automatically leverage that price difference for you, so without your manual intervening– Let’s say, I have to kick off something, or to the script there’s no need. You just need to configure that. This is the potential cost-savings, and also, it could potentially give you more powerful instances, so you could complete your job in the shorter time.

Another reason behind it is really the availability. We probably can hear the story from Justin later, but basically, let’s say, if you have a job running on the Spot instance, at some point Amazon decided to take them up from your machine pool or because there’s too many people bidding the Spot instance your price, it just doesn’t beat others, et cetera. That could potentially cost your job failure.

If you configure the heterogeneous which give them alternatives to provision other instance types and continue with your job. That’s from high availability perspective, it will help you to achieve higher SRA and give you more stability to run a job. Okay, I think, I’ve talked enough about heterogeneous and what is that. The next thing I’m going to pass is to Justin and he is going to tell you his journey from Oracle’s perspective about heterogeneous clusters. Justin.

Justin Wainwright: Thank you, Goden. Good Morning Joe. Good morning Goden, to my audience. Thank you to Qubole for the opportunity for us to share our story here. I’m looking forward to helping everybody know how we got here and know some of the challenges that we’ve encountered, and some of the lessons that we learned, and hopefully, it can save you all a bunch of money along the way.

Basically, just to give you some background, we’ve been on Qubole for about two to three years now. The last four to six months have probably been the most challenging, but also exciting for us. We’ve, basically, gone away completely from a concrete data center. We’re purely in the Cloud at this point and we’ve migrated from a number of other applications strictly to Qubole.

We have a very large footprint out there and we’re very dependent on it for our business. Basically, as you can see here, this is our representation of how we are today, but again, just as recently as probably four months ago, the slide would have looked a lot different. Back in September, we had about 75 clusters, 23 of those are Spark. We only had about 11 Hadoop 2 at that point, no heterogeneous at all.

We have about 41 Hadoop 1 clusters. The overall goal then starting the last quarter was to start investigating heterogeneous because we heard about the Spot Fleet feature and, we were looking for it to help us with a lot of challenges that we’re encountering, just with AWS’s market as a whole. The push was aggressively to explore Hadoop 2 with a couple of select teams and specifically focused on heterogeneous feature and see what challenges are going to around from that and how we are going to embrace that as our standard going forward.

Let’s move on to slide nine and we can get more in depth. Basically, the challenges that we’re running into was– We were trying to run 100% five clusters because we wanted to get away from on-demand as much as possible as much as possible. The problem was that we had some shortcomings and short sightedness, I guess, you could say from the beginning.

We’ve actually thought that Hadoop is memory intensive in most cases so it made logical sense to say, “Our three instance types are what we should use across the board.” That’s somewhat true, is it depends on what you’re using it for. We just deployed all of our clusters as R3s and some cases, some teams got a little bit more greedy with R3s and others, started setting their cluster sizes to 100 nodes, 200 nodes, 300 nodes.

Eventually, priced ourselves out of our own market. We were bidding at 300%, 400% of on-demand cost for R3.8xlarge’s and we just kept kicking ourselves in the foot and and costing us more in terms of on-demand usage. We started using SOA’s. We started running jobs multiple times over and over again. We’d have one job or one team start up a cluster and another cluster would start to run out of available nodes, you can’t auto-scale anymore.

We end it up with this sort of tribal mentality in terms of people were coming in earlier in the morning to get their clusters up first and win the battle of, “I got all the power today.” We needed to find a way to distribute the power and wealth and succeed across the board. We started looking into reserved instances. We bought a whole bunch of see c3.8x and r3.8x and thought, “This might help us.” it did to a certain extent.

The problem was we didn’t have fully utilized clusters, so as people would run their jobs and finish, they’d lose those nodes and then they go to another team and then whenever they come back from lunch and try other things up, they wouldn’t be able to run any clusters anymore. We were having the same situation that the later you got in a day, the more the tension you got.

Again, we needed to find another solution so that’s why we we ended up in slide 10, which is the next one. The first group that we decided to use as our guinea pig was the operations group. The reason why they were a good candidate was because they weren’t high demand use, but they were regular use. Their clusters stayed up for about, probably, 12 hours a day. They had a steady flow of jobs, but they didn’t really tax the system too much.

They have the jobs that were easiest to find to rewrite for Hadoop 2 as opposed to Hadoop 1. They had a moderate balance of a CPU and RAM as well as storage needs. They were only using 2x, 3x, 8x. When they first started they were running about one-third of the all the nodes on demand, now the use is around 40%, and then they were using that 25% of the reserve instances.

When we first started exploring to do two we moved them to an M3.2x cluster full of both master and slave nodes. Then their on-demand usage went down to about 15%, Spot usage went up to 75%, and then they were using about 8% to 10% of reserves. That was good but we still probably could do better. That’s when we started looking at heterogeneous.

When we started turning hands with M4.xlarge’s, 2xlarge, and 4xlarge. That’s when we really started seeing the benefit. The number here with the $1,300 a week into Spot cost, that was originally with the M3.2x. When we went to the M4 heterogeneous cluster, on-demand cost down to about 8%. Spot usage went up to 88% and then we’re only using about 4% reserves instances. We still had some M4s that we had bought one and started looking at those because again, we thought that reserves was going to be the price point that was save us the most, but as it kept using over time, we kept the spot prices dropping.

About halfway through this migration, as we got more and more jobs converted, rewritten, to typically embrace Hadoop 2 and some of the other engines and processing power that we had within Hadoop 2 and we started getting numbers even lower. As you can see last month, we’re down, like one-third of what we were originally paying for spot, so operations is very happy because, one, they don’t have to rerun their jobs with us.

Two, they don’t have the management team going after them complaining about how much money they’ve spent. Three, they’re getting their jobs done within acceptable limits. The main thing that we had to stress with them though was to start small. Originally, their jobs would run for probably eight or 10 hours, and we got those average run times down to about two to four.

Today they’re even better down to about one to one and a half hours, so this was a group that helped us formulate a lot of our standards for lot of our other groups. A bigger example is the next group which is our modeling group. These guys are running pretty heavy usage around the clock. These guys started on c3.8x and pretty much blew through all the reserves and instances that we had pretty quickly.

Once we started seeing the benefits from the operations group, we started exploring options for them and where operations we started with m4.2x. We started with these guys with a base of them m4.4x. They improved from from mainly on demand and reserved instances, so that initially to about 15% on demand rate and 85% Spot Instances, and that was the $4,000 per week for the cluster.

After we fully got heterogenous turned down with a combination of M4.xlarge, 2xlarge, 4xlarge, and 10xlarge, we actually got them down to about 2.5% on demand usage, and now they’re using about 95% Spot. Again, there’s almost a halfway cut here from a $4,000 down to $2,400 a week. The main difference with these guys is that they run larger average job sizes and they have multiple concurrent runs.

The Explorer using the 10xlarge instance types for them because at certain versatile times throughout the day they could blow through a ton of xlarge to 2xlarge instances. If they burst up of 10x then that means they use the left nodes overall, plus they free up more xlarge and 2xlarge from other groups that may need them for their clusters as well.

The recent change that we made with all of our ad hoc default clusters meant that we have a lot more heterogeneous clusters and therefore, a lot more demand for xlarge and 2xlarge so we need to have more available for everybody across the board.

Let’s move on to the next slide, please. Again, this focuses on the main time period here where, we explore heterogeneous, and made our big steps here for the last three to four months. As you can see the overall on-demand has gone down from 16% of our total bill to only 6% now. The Spot costs reduction has gone down from $172,000 a month to $134,000 per month.

Heterogeneous enabled, every Hadoop 2 cluster we have today is turning heterogeneous. We’re just starting to explore heterogeneous on Spark, but there are some tougher lessons to be learned there, namely with Spark you want to start even smaller. Basically, Spark uses executors which is unique to a Spark engine, and if you don’t tune it properly, heterogeneous won’t help you as much.

It still will help you, but if you try to force a 100 gig executor into a 2xlarge instance type, you’re not going to have much success. There’s a little bit more of your education that needs to go on there and there’s a little bit more reworking that needs to happen there. We have a couple of teams, they’re willing to dedicate some time and some effort into rethinking that and have had some success and so we’re looking to grow a lot in that area in the next three to four months.

Some of the other lessons that we’ve learned, primarily when going to Hadoop 2, you want to use the same master and slave base instance type. Again, look at your average sized job. Don’t go for your largest job, but go to your average sized, and go with, if you can, 2x, maybe 4x if you have to, and then give you a little bit of wiggle around both on the low-end and on the high-end if you need to burst or bend down.

Embrace ad hoc clusters. Don’t be afraid to put a small cluster out there and point many people to it and let them do the majority of their work there. You can always spin off a larger heterogeneous clusters for dedicated groups that have more specific requirements and you should build around those requirements. Whenever possible, use minimal overrides in your clusters, override up the job level, don’t set four or six gigs of memory for every task of the cluster because the heterogeneous is going to use the smallest chunk size that you have, and then you can always auto-scale up and use more nodes if you need to.

The most important thing whenever you’re using heterogeneous stuff is know your audience. Know which job is a CPU bound, which one’s a RAM bound, which one’s a disk bound. Know your job sizes, know your number of concurrent jobs, and most importantly, talk to your users and get feedback. You need to know which configurations work and you need to not be afraid to do some trial and error, and redo things a couple of times. It’s not an exact science.

You’re going to find some people running jobs maybe outside the ordinary, that maybe blow your heterogeneous numbers out of the water, and then you need to rethink, “Well, maybe those should be run on a different cluster, even more subtle.” Another limit is know your back-end architecture. If using AWS, know all of your AWS limits, especially in regards to instance limits.

If you do fall over tool on-demand, make sure you’re not going to blow through all of your on-demand instances in a particular region. If you’re using m4 instance types like we are, make sure that you have all your EBS limits and checks, know that how many volumes you can attach overall in your environment and amount of volume sizes that you’re using.

If you blow through all of your terabytes of EBS then you’re not going to be able to attach any more volumes to new nodes and therefore you’re going to attach your auto-scaling of opportunities. Make sure that your subnet allocations are large enough to hold more instances and heterogeneous considering you’re running before. Especially, if you’re using small instance types like large and xlarge. You could be running 10 times more nodes than you were in the past. It’s not necessarily a bad thing, but it can be a bad thing if you didn’t have enough allocation there to be able to spin up that many nodes.

Know the market, check your availability zones regularly, make sure that you don’t throw everything into us-east-1a for example and then have no instance types left to bid on. You’re not the only one that’s bidding on the market, but again, you could be like us and you can price yourself out of the market just by one team to the next. Consider whether special events are going on.

When reinvent is going on, for example, AWS uses her own instance types about 10 times the normal on-demand price list. You have to factor that into into the equation. Another option that you can use with heterogeneous that maybe is little bit too risky for normal usage is stable spot. We didn’t cover this earlier with the architecture, but stable side lets you bid spot prices on your minimum nodes which are normally launched on demand.

If you’re using heterogeneous it lets you split your risk across multiple instance types, so the odds of you losing a spot, nodes are a lot less and I send this configuration, so stable spot becomes a more affordable and safer option. Again, if you’re using Spark make sure to go as small executive sizes, build your jobs around a 2xlarge instance size for example. Then allow yourself just to spin up 50 to 100 executives if you need to. Again, don’t be afraid to do trial and error with with your groups and keep that feedback loop open as much as you can.

Consider Tez, consider high on Spark and above all avoid any costly operations if you can. Try to mitigate the risk of using costly operations like groupwise and operations that funnel you down to one task. A heterogenous isn’t going to help you whenever everything is going down to one reducer, you need to scale things as much you can and distribute the load as much as you can.

I know that was a lot of information to take in, but those are all the lessons that we’ve learned over the last four months. Let’s go to our default configuration here. These are kind of the numbers and configurations that we settled into again a large file and arrow went into this and a lot of inflated bills might have come into this, but this is what– this is the happy median that we settled in on.

This may or may not work for all of you, but this seemed to be a good spot for us. Again, the minimum nodes, we’d like to keep them low, that keeps on demand costs low and still gives you enough safety net on the back end, so that your not exposing yourself to too much risk there. Then again, we have the RSO to spin to spin up to larger instance types if we I need to.

Hopefully, this gives you a little bit of a framework and a little bit of example to start off those and hit the ground running and your results may vary.

Goden: Alright, thanks Justin! Hopefully, that was helpful to everybody and then you can think about how you may elaborate heterogeneous in your production environment and potentially achieve the same or even better results as Justin’s team had. Now I’m going to show a quick demo, so this is a screen sharing around my computer. Yes, so I’m sharing my screen right now, so if you’re not familiar with Qubole, so this is our Qubole Online portal, api.qubole.com, and there’s a dropdown. You can go to the clusters.

You can see I have a bunch of clusters already created. The configuration I’m going to show you, I can choose like Hadoop 2 cluster here. By the way, right now, the heterogeneous we currently only support Hadoop 2 and the Spark, these two. If I go to Hadoop 2, you can see here, here is your Master Node Type and the Slave Node Type. To enable heterogeneous configuration, we check this box, so you can see here we have a dropdown.

The node type is the heterogeneous instance types you want to configure when there’s autoscaling event happen. Auto-scaling is meant for the slave nodes, so here. Let’s say, if I choose from the same instance family which is c3, it will give you an easier way to configure the weight. Let’s say I’m choosing c3.4xlarge here. It automatically calculated weight as two. This is, I believe right now, is based on the memory, so because this one has the double size of the memory as the default, so it gives you a weight of two.

Now, if I change the default instance type here, let’s say I put it at the 8xlarge, the weight will become 0.5. This weight is always relative to the default instance you put here based on the memory size. Of course, you can choose other family instance type. Let’s say, if I choose another one, let’s say c4. Let me choose another, m4. It’s better. Here, you can see, it automatically calculated weight with the decimal– Yes, so this is, also, based on the memory here.

Usually, indeed when it’s across instance family case, we would recommend our customers to put your weight here because when it’s across instance family, the default calculation is always based on the memory, but it could be you want to have more cores, in that case in CPU, or you know your job better, you know what’s the clusters is used for, you know your users better, that’s why it’s better if you give us a weight here to tell us.

Equivalently, if I want to provision one instance here, how many instance I need from the heterogeneous instance. You can add, I think, up to 10 instance type here. This is a priority order, so when we actually kick in and say, “I’m going to do the heterogeneous provisioning for you.” It’s based on the order of the instance types listed and we will average the Spot Fleet in API from Amazon and to feed all the information there and then we get to the results back.

That’s pretty much the configuration. It’s easy to understand. If you only configure from the same instance family, then the weight is also pretty accurate, you don’t even have to change that, but if you configure across instance family, then you probably want to change the weight. Right now, you can see, that there’s a beta tag here, but very soon we’re going to remove this beta tag in the G8, this feature, for all of our Qubole users within a few weeks.

You can, also, learn the details from our online documentation, how you can configure this even through API, not just from UI. I think there’s also a blog. We wrote around last year November talking about the heterogeneous cost-saving and other benefits. I would also recommend reading that blog. It will give you more details then you can combine all the information there with the webinar and see if that could be beneficial to your organization. Now, I’m going to go back to-

Justin: Goden.

Goden: Yes.

Justin: Could I chime in on this for a moment?

Goden: Sure.

Justin: Specifically, in regards to weights. What we found was it was better to use a small instance type for your base, slave node type, because as you can see what the weight there if they’re decimals, that means it’s going to do two or four or more of those for each one of what you have for your slave node type. Again, you can run into a situation where we did where we blew through many nodes within a given family and exhausted the Spot a lot quicker.

We found if you went with a low master node type and a slave node type based around, again, your average size, then your weights below use less nodes in the other direction. In this case you’d use, instead of two 4x and an 8x slave configuration, you’re using half as many 4xs as you would 2x nodes.

Sorry, there’s one more thing. If you go up to the larger instance types you have less available nodes across AWS infrastructure so there are more extra large and 2 extra larges available than there are 4xs and 8xs, for example, so you want to use more of the ones that are available.

Goden: Yes.

Justin: All right, I want to get on with your demo. [chuckles]

Goden: Yes, that’s nice. I think that’s pretty much it, and I think the input from Justin was also very good. You want to check what’s the most available machine pool, in general, from the history perspective and see if that would give you the most benefits when you have to configure the heterogeneous configuration. I’m going to go back to the slides.

Hopefully, everyone can see much that way now. Here is a cost-saving estimation from Qubole perspective. Before you saw the slides from our cost perspective, but that thing put everything together from Spot instances which includes heterogeneous configuration. That’s the overall cost-saving they’ve seen from our cost perspective. Now, from Qubole perspective, we can single out the heterogeneous instances and see what exactly the cost-saving just for the heterogeneous part.

I did some calculations based on our product metrics. I picked up this October, which is this slide, and the December, which is the next slide I’m going to show you guys. October is the time Oracle just started trying this heterogeneous, so you can see the number of heterogeneous cluster configured with eight. The nodes, we provisioned for them was about 13K during the one-month period.

The number of heterogeneous cluster provision occurrences, what does that mean? This means whenever there’s an auto-scaling event occurs, and also, that auto-scaling involved heterogeneous instances, so I would account that as one occurrence, so that you can see over just 22 days, there’s almost 300 occurrences where the heterogeneous configuration helped the cluster to auto-scale.

The cutting here is based on the us-east-1 region. What I did was I calculate other heterogeneous Spot instances price from each availability zone like the most commonly used BCD and I did another reverse calculation say, “If I were to use the default instance Spot, what would be the cost for each zone with the number, and also, the instance type?” Here’s the math.

I can see from this BCD each zone, we save from 53% to 66%, to 59%. Remember, this is the comparison just to the Spot. This means the heterogeneous Spot instances actually save this much against the default instance Spot. We have another data in the raw data sheet, which shows– To compare to the on demand price, it’s almost about 90% cost-saving.

The next one is really the December data. That’s the same KPI. You can see from December, like we just mentioned, they started using heterogeneous more in across his team. There are almost like a triple no nodes compared to October, got provisioned and the occurrences are definitely more. Here, we see a little bit lower cost saving against the default instance, but again, if you compare that to the on demand, it’s still around the 90%. That’s the data we have from Qubole perspective.

The next one is the– I did another quick look just to see what’s the percentage when we convert the default instance type to the heterogeneous type with the user configuration. You can see here for 2X, it goes to XLarge to Large, roughly 2:1 ratio. For 4XLarge, we do have some hybrid use case. In this case, it was provisioned with the 10XLarge. It’s a very small percentage, but it actually represents the hybrid usage pattern from Justin’s team.

Remember, this is still the December data, so by January this year, I think, it’s more widely used so the data could be different. I think that’s just to give you overview what’s the percentage when these differences can be converted to heterogeneous type. With that, it means that I think, that pretty much the content we have for everybody. We’re going to move to the Q&A part, so we’re open to the questions. If you have any questions, please, put that in the chat window, and we’ll see if we can answer them.

Participant: I don’t have any questions, but are there any guidelines around designing weight to the different types in the heterogeneous cluster.

Goden: Are there any guidelines around designing weight to the different types in the heterogeneous cluster? I think we briefly mentioned that. If it’s within the same instance family, I think, that’s pretty much automatically calculated based on memory because we know the core in the memory is proportional in the same instance family. I think Justin also mentioned, do you want to reiterate it like how do you configure this weight for the audience? Justin?

Justin: Yes. Again, just use small base instance types and then first up is the larger ones if you need to as opposed to blowing through four or six-and-a-half or five-and-a-half, however the math works out, if you’re using larger instance types for your base. Again, it depends. If you’re using Spark, obviously, you need to use base types that match with your executor, your container sizes, but the base rule is try to start as small as possible and then first up if you need to.

Goden: All right, cool. Any other question?

Host: With that, we just want to thank everybody for attending. We’ve got something more, sorry. What has been the most useful or popular use case that lends to heterogeneous cluster?

Goden: I wouldn’t say heterogenous is specifically designed for a certain use cases. Any use case you have right now in production, it can potentially leverage heterogeneous. Just like what I mentioned, also, Justin showed you, the two different cluster types, one is operations cluster in our code. The other one is modeling cluster. These two, actually, have different use cases, but they both benefit from heterogeneous.

I would say, when you think about the use case, you should combine the benefits you can get from heterogeneous and see if you can design the whole configuration around the use case, so you can get more from heterogeneous. In general, heterogeneous can be used on the majority of the popular use cases at this moment.

Host: Justin, do you want to chime in on that question?

Justin: Yes, I would like to add to what Goden said. In our case, the two examples that seem to be the best fit for us was if you have clusters that stay out for long periods of time, then, again, that saves you from a spot of loss. Especially, if you’re running queries multiple hours at a time you don’t want to lose nodes and cause yourself headaches in terms of having to re-run jobs and worry about your cluster failing on you. Anything that looks like it’s going to run probably for a half to maybe your full workday, that’s a good cluster to enable heterogeneous on.

Again, if you’re using very popular instance types like the R3 family, in particular, bid prices are all over the place for R3s over the course of the day. It can be really cheap first thing in the morning, and by the time you reach around noon, it can be four or five times the normal price. If you’re using those type of instances, definitely heterogenous saves you a lot of headaches.

Host: Okay. There are people that are asking specific questions about Oracle, and I think that might go beyond the scope of this call, but we can give those questions to you, Justin, offline. Anybody else who has questions? I just want to forward to– Another question for you Justin. How much time did you spend on the trial and error?

Justin: I’d say, it depended on the group. Operations was a little easier to get them to fit into the heterogeneous model. We had about a month of back and forth there. Modeling was a month before they actually, fully got onto the platform. They had like a test cluster that they were toying around with for a while, and then, probably, another two to four weeks after they were live on it, we had to do a couple more tweaks based on their usage.

Again, your results may vary, but it’s a fairly easy configuration to get it running, so don’t worry about the learning curve, don’t worry about shock value when you first turn it on. Just turn it on and see what happens and then go from there.

Goden: I see another question from the audience. Does Qubole automatically go for the lowest price the spot instance for the job? The quick answer is, yes. There are two things about lowest. The first thing is we would automatically pick up the lowest spot price amount, the availability zone, if you don’t have a preference for where you wanted a cluster to be provisioned.

Another thing I forgot to mention is when we auto-scale, the new nodes will always be added into the same availability zone. Let’s say, if your master and your minimum slave nodes already in availability zone B, then when we auto-scale, we will add the new nodes no matter if it’s heterogeneous or homogeneous clusters, they will be added into the same zone. That’s for one one thing.

However, if the cluster is not up, let’s say, you started a new job, the cluster was just being started, then we’ll compare say, “Hey, across this three zones, which one has the lowest price for the spot and we can provision in that zone,” if you didn’t configure any preference, because we have a configuration in the cluster any zone we want us to provision this cluster. Most of our customers they leave it any, so which gives us the flexibility to provision the cheapest price.

Now, the second part is about because you will have multiple instance types in the heterogeneous configuration so Qubole will also automatically check which one is the– Is this the potentially cheapest, so we will provision that one for you in this case. Also, it’s based on the priority, we’ll try the first one, then the we try the second on the list.

Host: Thanks for that. We, also, got another question. When is auto-scaling triggered and how does it help?

Goden: This is more like a generic question for Qubole platform. Auto-scaling is triggered automatically when we detected the job load is beyond what do the cluster side can handle. If you still remember, go back to here. This is the Oracle typical compilation example. When you set up the cluster, you will totalize what’s your minimum nodes and your maximum nodes. Auto-scaling is helping you to scale between the minimum and the max node.

We have the max node because we want to have a cap or limit for the cost, we don’t want to scale unlimited layers and you cannot control how it large your cluster would be even your cost. This is when you configure your cluster with minimum and maximum, then we do the auto-scaling when the job is loaded exceeding the current size, based on that you can put heterogenous or if you didn’t configure heterogeneous, which is try to gather the spot instance from the default type.

Host: Perfect.

Goden: Okay.

Host: With that, I just want to take one moment as a marketer and just make an announcement about our upcoming Data Platforms 2017 conference. We will be hosting and sponsoring this conference that happens May 24th to the 26th. It’s Data Platforms 2017, it’s the first industry conference focused exclusively on helping data team build the modern data platform.

We’ll have luminaries coming to speak from eBay, Facebook, LinkedIn and Uber. We would love to have you all come and join us in May, in Arizona. One of the follow-up emails that we’ll send to you will include the link to Data Platform 2017. We’ll also send a link to our weekly live demo and the blog that Goden referenced earlier today, just to make sure you have all that information.

We really appreciate you everyone’s time today, a lot of you hung in there with us through the whole question and answer. Really appreciate all of that and we’ll see you in May, I hope. With that, here’s information about how to get in touch with Qubole. Can chat with [email protected] today or you can always go to the website, there’s a lot of contact us forms.

Obviously, we’ll be following up with you after this meeting. Thank you all so much. With that, we’re going to give you your day back. We appreciate your time, thanks so much. Bye now