Unlocking Self-Service Big Data Analytics on AWS
Discover the best practices for deploying or migrating to big data in the cloud with AWS and Qubole. Learn how to more easily create elastic Hadoop, Spark, and other big data clusters for dynamic, large-scale workloads. Unlock the value of big data on AWS with Qubole, a cloud-native big data platform to help you simplify and optimize your ML, AI, and analytics projects. The webinar features Rahul Bhartia, a Solution Architect at Amazon Web Services, and Dharmesh Desai, a Technology Evangelist at Qubole.
Rahul Bhartia: This is Big Data services and the Qubole service together. I am Rahul Bhartia, and I’m part of the Amazon Web Services team. I work in the Amazon Web Services partner team as a solution architect. Along with me I have Dharmesh Desai who is a technology evangelist with Qubole, and he’s going to talk about Qubole’s role in helping you manage your data, or how Qubole can provide the services you need to manage your data at scale effectively.
With that, let’s get started. As we all of know, that data is growing at unprecedented space. To give you some numbers today 1.7 megabyte of data might be generated by 2020 from every single human on the planet Earth. Just to give you an idea of numbers, today we have over seven billion of world population as of this year, and by 2020 the population could very well surpass 10 billions of human. That is a data set which pushes the boundaries of what a traditional technology can do in terms of analysis.
If you look on the other side, only a half percent, or 0.5% of the data is being analyzed today. Yet the Hadoop market, or the big data market is forecasted to grow at a compounded annual growth rate of 58%. It is forecasted to surpass a dollar one billion market by 2020. Combined with all of this data, it doesn’t really mean that you have to worry about, how we are going to solve the problem of the big data. As demonstrated by the customer such as, FINRA, or Netflix, or Unilever, Kellogg’s, Yelp, Mbalm and Siemens.
What they have done they have taken the big data and turned it to their advantage by doing new things with the data sets. For instance, Netflix not only uses big data to analyze what people are watching, but they understand how their viewers engage, what they can do, how they can create new shows. What kind of data sets they can cache so people get a better response when watching Netflix streaming.
To help you build some of these market solutions, you should look into using AWS for your big data workloads. Why you should look into using AWS for your big data workloads is, first you don’t have to think about provisioning any infrastructure or capacity. All of the services provided by AWS are immediately available for you to use. You can get started with a Hadoop cluster, or a data warehouse cluster in few minutes, and you can terminate the cluster when you don’t need them. Thereby saving the cost of your big data workloads.
Not only that, AWS this gives you very broad and deep capabilities, in terms of compute Network and infrastructure, but going upper in the stack it also provides you services for things like real-time streaming, for Hadoop clusters, for data warehousing and even for BI workloads. Which you can utilize for your big data, or for your database, and other bigger workloads quite easily. On top of that you can even build a very trusted and secure platform by leveraging the compliance.
Such as PCI DSS, FedRAMP, FISMA, ISO 27001, or even HIPAA compliance. You can encrypt your data both at rest and in transit using the features we provide when your data is stored in services such as S3 Dynamo, or even Amazon Elastic Block Store Volumes. All of these services work at the scale, doesn’t matter whether you are a customer who are at a Netflix scale, where they’re storing petabytes of data, or whether you are a small customer who is storing only a gigabytes of data.
Each of these situation or with AWS or with cloud you don’t really have to worry about changing your data infrastructure as your data set grows. You can take advantage of the cloud elasticity, on Spot Instances to learn how to manage your workload better, and in the most cost effective way. If you look at the broad and deep capabilities of the AWS platform, we provide you services which can help you collect, store, analyze, and visualize the data sets.
From collect and store, we have import/export service, where you can ship your volumes. We have Snowball, which is a new service announced from AWS. Direct Connect where you can make AWS part of AWS cloud, called Virtual Private Cloud, extension of your own data center. Or we have virtual machines import and export. Going to the storage size, we have Amazon S3, which is a block object storage, Amazon Glacier, which is a long-term archival storage.
Amazon Redshift, which is a data warehousing. Amazon DynamoDB, which is a no sequel database, and Amazon RDS with Amazon’s own offering called Aurora. Which is a MySQL compatible cloud native database. On the analysis side you have seen things like Amazon EMR, Amazon EC2, Amazon AWS Lambda, and even Amazon Kinesis which can help you collect data into real time. For the visualize part, we recently announced our service called, QuickSight.
QuickSight is being available under preview right now, which you can quickly use to analyze and visualize data sets in your cloud. Now these are some of the examples we mentioned, in terms of what are the services you could use to handle the different aspects of a big data workflow. What you could do with some of these services and your data sets on AWS, is truly immensely bigger than what we can think.
Here to mention some examples, such as you could build a big data repository, or the quintessential data link. You could do click stream analysis. You could even do things like ETL offload, where you can take advantage of the cloud to give you the bus capacity in case your on-prime infrastructure is not able to cope up with the growing data sets. With Amazon machine learning you can get started doing something as advanced as machine learning in few hours, per se.
You can build the online ad serving platform using DynamoDB, or even you can build a BI application quickly using Amazon QuickSight. These are some of the examples of what you could do using the AWS big data platform and your data set together. With that what I’m going to do is, I’m going to invite my colleague Dharmesh here, to talk about how to migrate your big data to the cloud and take advantage–
Dharmesh Desai: Thanks Rahul for a great introduction on what Amazon provides as products, as well as services. Hello everyone, my name is Dharmesh, or Dash, and I’m a Technology Evangelist with Qubole. For those listeners that are not familiar with Qubole, or our Qubole data service. We’re a big data as a service platform in the cloud founded by Ashish Thusoo and Joydeep Sharma in 2007.
Prior to founding Qubole, they were part of building and leading the original Facebook data service team. During that time they also authored several industry-leading big data tools, including Apache Hadoop project. In the next 30 minutes or so, we’re going to cover several topics ranging from Qubole as a big data service and self-service platform. It’s cloud advantage in conjunction with AWS, as well as big data Hadoop migration service that it provides, followed by an in-depth look at a customer case study.
Let’s get started. We’re going to quickly look at some of the advantages. Let’s start by reviewing some of the advantages of migrating your big data deployments in the cloud.
One of the big one is being able to separate storage from compute. We’ll talk about this in detail in just a little bit, but basically it reduces the complexity of managing multiple environments. For example, production, development and analytics processes could run on the same data sources without having to either duplicate or replicate a data.
Then since initial provisioning can occur in hours instead of weeks or days, it provides quick time to value. Since the iterations can be faster, you can also make changes on the fly while clusters are actually up and running. It’s all policy driven with minimal operations management overhead. It’s pretty obvious that you can take full advantage of the elasticity of the cloud, and QDS has built-in auto scaling at multiple levels. Including storage, compute and even at its far job level.
Depending on your workloads, it automatically scales the clusters. The the other advantage is overall cost, total cost of ownership is significantly lower. A couple of tidbits there is, you can mix and match on demand spot and reserved instances. Just a side note, 84% of Qubole instantiated clusters leverage Spot Instances, and that results in up to 80% of savings for our customers compared to on-demand pricing.
The flexibility is also much higher, because QDS out-of-the-box supports 40-plus Amazon instances, and you can create, test, run and schedule queries using multiple engines. For example Hive, Pig, Presto, Redshift, Spark, et cetera. It also provides Rest API for programmatic integration from your existing applications, as well as SDKs for Python, Java, R, and Scala. Last but not the least, it has ODBC and JDBC connectors to Redshift, MySQL and NoSQL databases such as MongoDB.
As Rahul mentioned earlier, many companies are using data as a differentiator and Pinterest is definitely one of them that uses the value of data as core part of the customer experience. If you’re a foodie, for example, descriptive guides will help you sift through all the yummy recipes and foods and cuisines that you like. While you’re tapping on pins that you like, it will steer your search in the direction to get more specific to your sweet spot.
Pinterest relies on QDS, or Qubole running on Amazon for several different reasons. QDS’s advanced support for Spot Instances, and also 100% Spot Instance clusters, and relies heavily on Amazon S3’s eventual consistency protection. As well as QDS’s advanced auto scaling. According to recent case study, Pinterest is known to generate over 20 billion log messages and processes nearly a petabyte of data with Hadoop each day.
If you put it in numbers as far as users, they have over 100 regular MapReduce users that are running over 2,000 jobs each day through Qubole. That’s pretty up there. Just to briefly touch on why big data deployments are difficult. Well, they can be really rigid. There’s also a chance of you being locked into a certain hardware specification, or software licenses and so forth. It’s also something that requires highly specialized teams in-house that can handle or maintain these kinds of deployments.
They’re out of the box really difficult to build and operate. Just to give you a sense in numbers, it takes about six to 18 months to implement. Out of those implementations, only 20% of Big Data initiatives are classified as successful. Of those, only 13% of implementations achieve full scale production and according to surveys or studies that have been cited here, 57% of companies cite skills gap as a major obstacle. Give it a second for these stats to sink in.
Now, the companies have an option here, so you don’t really have to go down the path right. Let’s look at how Qubole simplifies big data deployments in the cloud by leveraging clouds unlimited scaling and compute capacity. In the next few slides the short of an actual demo we’re going to review some aspects of Qubole’s web interface and we’ll see how they cater to different members of the data team including data analysts, data engineers and data admins.
Let’s look at data analysts first. What you’re looking at here is a notebook part of the Qubole’s web interface. Notebooks provide an interactive interface for data exploration. Here analysts can view, run, visualize the results of SQLs, Scala, Python in a single collaborative environment. They can also do it quickly and they can save queries. Queries and results are persisted and can be viewed even when the cluster is not up and running. They have the ability to create multiple notebooks targeting different engines and different clusters.
It also has a way to integrate into GitHub for version control and tracking changes. If you have a large team that kind of integration comes in really handy. What notebooks allow them to do is build on a larger application step by step. If you have a couple thousand steps working toward a particular machine learning algorithm using Spark, you can version control the entire application using the GitHub innovation that I think it’s pretty neat.
They can also use a Smart Query interface. What it allows them to do is visually build queries without having to know SQL, or if they just don’t like typing like a lot of people do. It can also be used to compose queries with filters, order by clauses and limit the number of rows it affects. It’s a pretty handy interface. Let’s move on to data engineers. What we’re looking at here is a scheduler interface part of the QDS, or Qubole data service.
Using the scheduler what data engineers can do is run complex queries and commands at specific intervals without manual intervention. They can build out workflows and schedule jobs for automation. It also supports scheduling queries and commence targeting multiple engines. That will include Spark and what you see here is a Spark SQL, they could do, Hive, Presto, Pig and it also lets them import and export data, external data source as part of the schedule work fault.
One of the advanced settings that I’d like to point out is being able to add dependencies. For example, if you have a scheduled job that depends on a particular Hive partition, you can set that here. Another example would be if you want a particular file to exist on S3 before a job runs. There’s a lot of these different things that you can set within this interface when you are scouting these jobs.
Moving on to the next interface analyze. Again, data engineers can use this to compose complex queries and commands targeting different engines, Hive, Hadoop, Pig, Presto and on the left you can see a number of tabs. Using the compose tab, what you actually see right now is where you would run your all your complex queries. These composed queries can be saved for future references and these saved queries can be accessed via the repo tab you see on the left.
Using S3 tab they’re able to look at all there Amazon S3 buckets and see what files and folders they have created. That’s also visible on the left next to tables. Tables tab allows them to examine all the schemas and tables in Hive that they may have created, and allows them to look at the column and their data types as you can see in the screenshot. The history tab is really handy.
It allows data engineers to access past queries, results and logs, and any comments that may have been added during the collaboration phase between team members. Important thing to note here is that all of this is persisted and can be accessed even when the cluster is not up and running. You can go back and see what you did differently a week ago, month ago look at the results, you can compare the results, you can export the results right from here.
A lot of different things that are possible through this interface and obviously you can also target different clusters depending on the queries and nature of the workload via the drop-down that you see on the top right. It says, “Default”, right now. So pretty handy interface. Now, let’s look at how data admins can use the unified interface, as part of their daily schedule. What you’re looking at here is the Control Panel.
This is where they would manage settings related to clusters, users, roles, sessions, et cetera. There are a few that I really want to highlight. What you see here is the Cluster Settings page. There’s a few attributes that are really important as far as the admins go. One is, cluster type. You can pretty configure either Hadoop, Hadoop 2, Spark, or Presto cluster. The next that I’d like to point out are min and maximum accounts.
You can set the initial number of slave nodes in the cluster you like, and then specify the maximum number of slave nodes. You can specify the node types for each. QDS out of the box supports 40-plus instance types. Another really important attribute here is disable automatic cluster termination. Qubole by default terminates clusters when not in use, but that can be overwritten by this setting. What this means is, the clusters will not be shut down.
Although, the clusters will still be downsized to the minimum size when there’s not that much workload. The other section I really want to target here really quick is the cluster composition, which is really important when it comes to auto-scaling TCO. Obviously, depending on the use cases and workloads. When auto-scaling the cluster beyond the minimum node count, you have the opportunity or you have the option to select either On-Demand or Spot Instances as part of your auto-scaling nodes.
That’s a good setting that comes in handy, depending on obviously, like I said, the use cases. The next one that I’d like to highlight is Qubole Block Placement policy. To minimize the impact of losing a node, if you use, for example, all spot which are volatile instances. Qubole implements this policy which is in effect by default and forces a copy of data to be stored on stable nodes. The next one is called Back to On-Demand nodes.
Setting this means that if Spot Instances are not available during the time, based on the timeout specified. Qubole will automatically fall back to On-Demand, so it does not affect your jobs or processes that you have running currently. The Spot Instance percentage is, how many of those did you want them to be spot versus either stable, or On-Demand.
Moving on to the next screen here, this is where admins can manage roles, users, and groups.
I won’t spend too much time on this, because it’s pretty standard implementation of roles, users, and groups-based security. This is where they would set which users can access what part of the QDS and so forth. Users generally, can join many accounts with a single profile, but the user can only have one default account at any given point. Moving on to the next interface that they would use. Kind of important, this is where they would monitor cluster usage.
What they can do from this is, look at the just different settings and different ways to generate reports. For example, how the cluster was formed and number of on-demand versus Spot Instances, start time, end time, cluster usage hours. They can also generate reports based on given date range. The status tab exposes latency-related statistics command, latency distribution, command statuses. How many jobs in progress, how many failed, how many succeeded, how many scheduled jobs are in progress, and so forth.
Another interesting view there is the leaderboard, which lists total commands and scheduled jobs run by individual users. Pretty handy for admins to keep an eye on. This is just a view of a graph or detailed cluster usage. Basically, it’s showing how the system is auto-scaled and chosen On-Demand versus Spot Instance in this graph. This is all as a result of auto-scaling. There’s also built-in integration with Ganglia for monitoring clusters.
The admins can view performance of the cluster as a whole, as well as inspect the performance of individual node instances. They can also view Hadoop metrics for each node instance. There are some other integrations that we don’t have. I’m not going to spend too much time on in this presentation. You can also integrate with Data Doc, and we have ODBC drivers for Tableau. That was a very brief overview of Qubole’s web interface and how it caters to the members of your data teams.
Now, as I mentioned earlier, Qubole or QDS takes full advantage of the unlimited scalability of Viacloud. It’s built-in auto-scaling feature automatically adds resources when computing or storage demands change. All along the way while keeping the number of cluster nodes at the minimum needed to meet the processing needs of your workloads.
Just to give you an idea of our customer base. Customer scale up to the clusters in average of three, four times its minimum size that they have pretty configured.
47% of all computed hours are run with Spot Instances, without user intervention. Just a tidbit there. Basically, I’m going to repeat a couple of things here, so we get all caught up. Before we address the Cloud advantage of using Qubole on AWS, let’s briefly review some of the attributes of on-premise cluster deployments. One of the biggest ones in my mind is that it forces compute and storage to live together. Which also means that it should scale together, which I don’t think is ideal.
Clusters must be provisioned for peak capacity. Which can obviously, be very expensive and also leads to under-utilized resources during off-peak usage hours. The clusters must be persistently on, otherwise, the data is not accessible. On the other hand, because storage is cheap and persistent and compute can be expensive, using the Cloud, we are able to do a couple of storage from compute. The idea is to use persistent storage service like Amazon S3, and use compute power, On-Demand, and selectively.
By doing this storage, we can centralize storage and we can have computation as distributed and on demand. Resources can scale elastically based on the workload. For example, compute heavy versus storage heavy. Now, what we’re looking at here is a– Well, as I mentioned earlier, Qubole on AWS takes advantage of the power of unlimited compute capacity in the Cloud using its built-in auto-scaling feature.
What you’re seeing here is a reproduction usage report that we’ve pulled out, where the clusters started to downscale around 3:00 PM, and is scaled back up at 7:00 PM when the batch jobs started to execute. Basically, Qubole dynamically matches the size of the cluster with the workload and automatically as resources when computing on storage demand increase. Just a side note 80% of all clusters are turned off automatically by Qubole versus manually by a data admin. This just a tidbit from our existing customer base. In addition to unlimited compute capacity, Qubole on Amazon or AWS, it supports 40+ instance types, spanning niche, suggest storage optimized, memory optimize, and compute optimize. It also easy integrates AWS reserved instances. All you have to do is select and easy in the cluster config page. As a side note, we’ve seen in our current customer base that 37% different instance types have been used by our customers.
Before we review Qubole on-premise to the cloud migration services in partnership with WANdisco. Let’s quickly look at some cloud migration use cases that are pretty common these days. The first one being where you have a maxed out on-prem cluster that just cannot handle workloads anymore. Some of the requirements for these kinds of deployments could be that data must be in sync only during migration process. You must decommission workload from on-prem after migration.
All the analytics, jobs, processes are running against the beat of the migrated deployment versus on-prem. 24/7 data replication and dual production downtime. One of the ways you can handle that is by moving the data to the cloud, and then moving the applications and data pipelines to QDS. The second use case might be that the workload with spikes that can be processed on-prem. A good example is holidays, yearly sales, et cetera.
The requirements are similar, in that there shouldn’t be any downtime, and a 24/7 data application with no data loss. An additional requirement here might be that, the results that you see should be brought back to on-prem, that’s why there could be a requirement for bidirectional replication. The solution in this case would be to sync on-prem data with the cloud, process workloads in QDS in the cloud, and then send results back to on-prem.
Then the third one is where you have a single shared cluster deployed for production, and it’s just not able to handle the use cases because it’s reaching its limits. Requirements here again, there shouldn’t be any downtime, and then periodically application with no data loss. In terms of solution what could be done to free up on-prem resources for production workloads. For that you can move some subset of data to the cloud, and build applications and data pipelines in QDS that target different clusters from the unified interface.
these types of use cases to address them, Qoodles, Hadoop migration services offered in partnership with WANdisco. It’s basically a software-as-a-service solution that allows companies to transparently migrate workloads to the cloud, and start using Qubole immediately. Some of the benefits are elasticity, lower cost, agility, simplified management, no production impact, data consistency between on-prem and cloud, and you can also use your existing data pipelines, ETL applications and so forth.
You can definitely learn more about this offering on Qubole.com. Moving on to our next section. In this section we’ll briefly review MediaMath’s journey to the cloud, basically QDS on AWS. For those listeners that are not familiar with the MediaMath, it’s the leading global digital media buying platform based out of New York City. It develops and sells tools for digital marketing, managers, under their main brand called TerminalOne.
It allows marketing managers to plan, execute, optimize and analyze marketing programs. The analytics team and insights team at MediaMath is responsible for delivering decision-making infrastructure and advisory services to their clients. Basically the team helps clients answer complex business questions using analytics that produce actionable insights. Hear, what you see on the slide are top three of their use cases.
The first one is segment audiences. Based on their behavior, including such topics as user pathway and multi-dimensional recency analysis, they need to segment different audiences. They also need to build customer profiles across thousands of first-party and third-party segments. Examples would be CRM files, demographics and forth. Third one is, just simplifying their attribution insights. Which shows the effects of upper funnel prospecting on lower funnel re-marketing media strategies.
To be honest, my knowledge about this domain kind of ends here. I’d to move on to the next slide. The flagship product today, TerminalOne captures data that is generated when customers are on digital marketing campaigns. The data amounts to a few terabytes of structured and semi-structured data in a day. A few terabytes, that’s a lot of data. It consists information of marketing plans, campaigns, impressions, clicks, conversions, revenue and so forth.
MediaMath was looking to take their existing capabilities to the next level to manage these new innovative analytics tasks. As you can see they have 180 billion impressions or opportunities a day. Their peak query per second is like 3+ million, and they process up to 3+ terabytes of compressed data. That’s a lot of data. Here are some of the challenges that they needed to find a solution for.
As processing the raw data to segment the audience, optimize campaigns, compute, revenue attribution and so forth. This is what they were facing. Transforming session log data to construct user sessions and click path analysis, it’s a complex process. They wanted a solution that their analyst could easily use and get started with quickly, and did not have to worry about operational management of such technical operations.
They also wanted to make sure that they could re-use their data pipelines. Automating the execution of the data pipeline while honoring the interdependencies between the pipeline activities. These pipelines they repeat it, and they learned the same transformations of daily basis, week after week without much intervention from their team once it was set up. They wanted to make sure that whatever platform a solution they could they could reuse these pipelines.
Obviously the upfront investment and commitment is always a challenge, and a factor when companies look for solutions. Basically this is what they were looking for according to their senior director. They needed something that was reliable easy to learn set up used and put into production without the risk and high expectations that comes with committing millions of dollars in upfront investment, and they decided to choose Qubole.
To summarize, many factors played a key of a role in their journey to the cloud, QDS and AWS. One of them being a big data analytics solution. Within hours they were able to reuse a number of useful business-critical custom Python libraries that they had developed, matured and stabilize. Then these libraries they were able to import them and start using right away. The other one being unlimited cloud capacity via auto scaling.
The clusters automatically grew the number of compete nodes as decided to move and run more queries and scale the cluster down as the number of queries went down. The operational efficiency was a plus as they didn’t have to continually reach out to their engineering group and that led them focus on complex tasks of managing their machine critical systems, and obviously that they are existing data pipelines and so forth.
This is where Qubole is being used at MediaMath, is being used in a lot of different ways by a lot of different teams. Data science they use Spark analytics, Spark and Hive. Product team uses Hive, analysts use SmartQuery and engineering uses Spark and Hive. In the interest of time, I also want to make sure we have enough time for Q&A. I’m just going to move on to this. This slide pretty much accurately describes itself. With that I’m going to hand it over to Rahul.
Rahul: Thank you, Dharmesh. That’s right, we’re talking and we learned a lot about Qubole and big data, what you can do with big data using the clouds. With that I’ll open up for Q&A. If you don’t know you can type your Q&A question in the panel. As attendee you must see a panel to type your question and answers, and we will start reading out some of the questions as the time permits.
All right first question is, while Dharmesh talked about the idea of being able to separate the compute from storage. The question is what are your thoughts are on performance of storing data on S3 versus HDFS? Which is the file system storage with Haddop. As some of the customers have demonstrated for example like Netflix for storing a lot petabytes of data onto S3, and that they have been able to scale the system much better, that they would otherwise have to do on if they happened with Hadoop as a storage.
What helps with the storing data on S3 is that, you can infinitely scale your compute instances rather than trying to put all of your data onto a single known making that node vertically scale. The idea of horizontal scaling up your workload actually helps with data sets in S3. Because you can have not only one single cluster talking to S3, but you can have multiple clusters talking to S3 at the same time. What are your thoughts Dharmesh on that.
Dharmesh: Yes, definitely. If you compare S3 to HDFS directly with the simple test, HDFS will obviously be faster. S3 really scales well linearly with the throughput as the connections grow. That’s pretty much how big data engines work. The more data you need to read, the more tasks that you can deploy for those kinds of jobs.
Rahul: Cool. Thank you. The second question is, “Do I have to have my data in a particular format in Amazon S3 to be able to use with Qubole?” Now, the advantage of that is Qubole or any Hadoop for that matter works by connecting to S3 using something called, Hadoop Compatible File System. Which means that any format you can use with HDFS you can actually use with S3. I will let Dharmesh talk more about that.
Dharmesh: Qubole definitely supports all the standard Hadoop formats, AVRO, ORC, Parquet. We actually also support custom formats or propriety as long as you have servers to read the data.
Rahul: Cool. Thank you so it seems like you can store your data set with any sort of format you want in Amazon S3. Whether be it a text format, whether be it a binary format, or whether be it a column format has been made popular with the formats like Parquet or ORC, centrally.
Dharmesh: That’s correct.
Rahul: The question is, “With regards to auto down-scaling and in QDS, would that impact or would it have any impact on my active jobs?”
Dharmesh: No, there’s no real impact because of wasteful shut down by making sure that there aren’t any active jobs running, and also keeps track of the our boundary of Amazon. The answer is, no.
Rahul: Okay the next question it says, “Do you support heterogeneous clusters?”
Dharmesh: Yes, in fact we do. We support heterogeneous Spark and heterogeneous clusters. Which means that the slim nodes comprising the cluster may be of different instance types on Amazon.
Rahul: What would be the advantage of doing such a scenario? Is that something you talked about in terms of a storage heavy workload versus compute heavy workload?
Dharmesh: Exactly. Then the other use case would be if you really have a tight budget, so basically TCL.
Rahul: Okay. You can better control your costs in terms of workload match.
Rahul: This question says that we talk a lot about ad hoc analytics with big data. The question is, what about production pipelines? Is there a way to schedule jobs, or invoke things more programmatically, rather than just go into your UI?
Dharmesh: Yes. You can definitely create simple workflows in QDS using the schedule interface that we talked about earlier. Outside of the web interface you can use API’s, Rest API’s, as well as SDK’s in Java, Python and R. All the detailed documentation can be found on our website. Also note that we do support Air Flow. If you’re already using Air Flow you can create complex pipelines using that and have those integrate with Qubole QDs.
Rahul: Now I see a bunch of questions around the voice loss, and really sorry about that. We don’t want to go back to that. The next question from our users, or from attendee is definitely in the reflection of the something we talked about earlier. Which is would you shed light on the trade-offs of decoupling storage and compute. As a result of this decoupling or data becoming non-local, does Network need become the bottleneck?
In some aspects of this may remain true, that as you decouple storage and compute, there could be situations where network may become a bottleneck. The right way to think about that would be that, instead of making network of bottleneck you can actually scale out horizontally. Instead of having a 10 node cluster for example, which gives you a limited size of HDFS. With S3 since it scales linearly you probably could have a 100 smaller nodes, in which case the network will never become a bottleneck.
Data still will go over the network, but you won’t be able to see the performance impact as such. For people who are interested in this, we definitely can probably put a link back to the Netflix study. Where they did a good study on showing that how the idea of storing data set into s3 instead of HDFS, is really does not lead to any more performance impact then maybe let’s say three or 4% on average. What do you think?
Dharmesh: Yes, I totally agree. Definitely depend on use cases. Most use cases that we’ve come across, the benefits of separating the two definitely outweighs the trade-offs or the disadvantages of separating the two.
Rahul: Okay the next question from attendee is that do we have a guide some quick start guide to get hands-on with AWS. Assuming this is most probably in terms of big data, yes, if you go to Aws.amazon.com/big-data, you actually will see a link there which will say, “Getting started”. The getting started will point you to everything needed. Now, at the same time I just want to take a moment and ask you that you will be able to see an online poll right now please do take a moment to answer the online poll.
Your feedback is very important for us. Entire AWS works on the premise of customer obsession, and the more you can help us provide feedback, and the more we can cater to contents which can help you with your workload on AWS, effectively. We will keep– Again, thank you for taking time today to attend the webinar. We hope that you got to learn something out of it. While you’re answering the polls, we will probably answer another question or two.
The other one says that, “Is the webinar recorded online? Where do I access it?” The people who attended the webinar, they will get a link probably within a week on how to access the webinar, the content of the webinar and the recorded audio for that. The question is, “How does Qubole determine if something’s going to be processed via Spot Instance?” I think, the way to probably read that is that, “Is there a way Qubole can automatically use Spot Instances, or does a user has to tell Qubole to use Spot Instances?”
Dharmesh: Well, all of these settings are configurable through the web interface. You can choose to use Spot Instances as part of your R scaling nodes, if you will. You can definitely set it so that if Spot Instances are not available, you can fall back On-demand, or you can also use Qubole’s Spot Instances. The other part of this question is, if there’s a way to know which processes are being processed via Spot Instances. The short answer is yes.
The admin web interface that I showed earlier has a lot more details than I could present in a slide. There’s detailed reports that you can generate off of that. You can see exactly which job are running on which instance, and how much load are they processing, and so forth.
Rahul: Okay. Cool. Thank you. Another question which says is, “Any archiving methods to move all unused data to less expensive storage?” If you’re storing with Qubole, you probably are storing all of the data in Amazon S3 on AWS. In Amazon S3, we do provide something called a “policy”, “Life Cycle Policies” to be specific. Using the Life Cycle Policies, you can say that any data which has not been used for last amount of time are as older than, let’s say, 90 days. You can automatically base move it to Glacier, and that will actually help you with archival.
You could definitely create a life cycle policies on your S3 buckets and objects, which automatically move them to Amazon Glacier for long-term storage. Even in S3, we do provide something called, “S3 Infrequent Access”. That’s a different tier of storage where you can say that, “Move my data to S3 Infrequent Access”. You will be charged at a little less amount than what S3 Standard is being charged for. Okay. The question is, “How does Qubole control the type of instance to be used?” Do we control it?
Dharmesh: No. It’s totally dependent on the user or data admin depending on the use case. They would choose for a particular cluster to only use on-demand, or to use a heterogeneous cluster. In that you can mix and match Spot and On-demand. You can also have a 100% Spot Instance clusters. It’s really dependent on the user or the organization, data team.
Rahul: That was the last question. We are hitting the hour now. Again, all of the attendees, thank you for attending the webinar today. If you have a moment, please do take time to answer the poll questions. We really appreciate your feedback here. Again, thank you. We hope that you got to learn something to today about Qubole and AWS, how to use them with your big data workloads.