Accelerate Big Data Application Development with Concurrent and Qubole
Ali: Thank you, everyone, for joining us today. We really appreciate it. My name is Ali and I am from Qubole. We put a lot of work into these webinars. We don’t like to do really salesy webinars. We like to get our friends to join us and, hopefully, teach you something new. That’s what we’ve done this month. We’re really excited about this month’s webinar. It’s “Accelerate Big Data Application Development with Cascading and Qubole.”
Before I get into the introductions, just a couple of housekeeping notes. First of all, we love when you, guys, ask questions. We allot some time at the end of the webinar, as always, to answer your questions. If you have any questions over the course of the webinar, you should see a chat window on the bottom left of your screen. Please, feel free to ask your questions as we go along with the webinar. We’ll be keeping track of them. We’ll address as many as we can at the end of the webinar.
Keep an eye on that bottom left chat window to ask questions at any point. Also, we will be sending out an email after the webinar. We’ll be including some of the slides from the presentation. If you have any other questions or follow-up, just feel free to respond to those emails that you get after the webinar. We’ll also have a recording of the whole webinar. If you want to reference it again later or send it to somebody else, we’ll have that available as well.
All of that said, again, this is a great webinar. It generated a lot of interest this month. What we have for you today is a pretty informative webinar with a nice demo also. You could see all this stuff works together. We have two presenters for you. The first is Dhruv Kumar who is the Solutions Architect at Concurrent. He has over six years of diverse software development experience, Big Data, high-performance computing applications. Before he went to Concurrent, he worked at Terracotta as a Software Engineer.
We also have Ashish Dubey who’s the Solutions Architect over at Qubole. He’s worked in the industry for more than 10 years. Prior to Qubole, he spent several years at Microsoft as a major contributor to Windows XP SP3 development. Really talented and seasoned guys here who are ready to help you with understanding how to accelerate Big Data app development. Without any further ado, I’m going to pass it off to Dhruv Kumar.
Dhruv Kumar: Thank you, Ali, for that great introduction. Hi, guys. Welcome to this webinar. In this webinar, I’ll be taking you through the Cascading portion, how you can use Cascading to accelerate your Big Data application development. Now, for some of you who may not know, Cascading is provided by Concurrent which is a company founded in 2008, headquartered in San Francisco. Cascading was developed initially by Chris Wensel who had two key observations about the Big Data landscape. Number one, it is very hard to find skilled MapReduce Hadoop developers which can write Big Data applications. There are a lot of Java developers out there but not many MapReduce app developers.
The second key observation is that while Hadoop may be the interesting choice of Big Data app development now, it is being used everywhere, tomorrow something else may come up. The investment which you make in your Big Data apps should be future-proof. They should be agnostic of the underlying technology platform on which they are supposed to run. Hence, Cascading was constructed in a way to solve both these problems. The other key product from Concurrent is Driven, which is an application performance monitoring tool. Because Cascading is so simple to use, it is used by a lot of enterprises. 7,000 deployments are there.
One of the key examples of Cascading in production is Twitter, which uses a layer of Scala on top of Cascading. Because Cascading is a Java API, it is very easy to write other JPM-based languages on top of it. For instance, if you put Scala on top of Cascading, it becomes Scalding which Twitter uses internally to add quality user experience and revenue application. Scalding is really awesome because it allows Twitter engineers to program in their language of choice which is Scala. As you know, Twitter uses a lot of Scala internally for their development work. In a few lines of code, you can create these very very expressive workflows which do things, all of it, from machine learning to ETL.
These is all possible because of the expressiveness provided by the Cascading API. Given these observations, let’s take a step back and see what does an enterprise in today’s world need from a framework which allows it to do Big Data app development. You see, it is very easy to get started with ad hoc script and analysis. As the complexity of your app grows, you quickly find yourself in trouble because now, the tool which you started with doesn’t give you the degrees of freedom to solve more complex problems. Here, Cascading shines, because it makes easy things very easy to use, simple. As the complexity of your app grows, it gives you the tools to tackle those problems at scale.
Not only that, it is very easy to swap out the underlying execution framework in Cascading. Today, you’re running on Hadoop but tomorrow, if your affiliate requirements change, you may find yourself swapping the underlying execution framework out. It is very easy, once you write in Cascading, to go to a different platform. Further, you need a tool which allows you to look into the execution blocks of your application, see how they are performing, what is the runtime characteristic of those logical blocks, so that you can meet your SLAs and also improve on your operational characteristics. Let’s quickly dive into a “Hello World” example for Big Data which is a favorite word count application.
This slide shows you how easy it is to write a simple application in Cascading. In the center of this slide, you see the processing block. That is where the business logic of the word count is executed. You see over here, I am not concerned with MapReduce at all. What I’m doing is thinking in terms of metaphors which I have learned while initially learning how to program, which is Unix pipeline. If you even look at the book Hadoop: The Definitive Guide, the book opens up with similar constructs like grab, filter, sort, et cetera. These are the type of constructs which you use in Cascading as well. For instance, over here, we are taking a line we are splitting up using a regex parser, which comes in-built with Cascading.
Then, we are passing it to a pipe to group the tokens. Then, we are putting an aggregator on it to just count them. You see over here, we are thinking that’s abstracted away from the low level MapReduce key value pair type of stuff. Further, if you see in this slide, in the integration aspect, we are creating certain Taps. Taps, in Cascading parlance, are the tools which allow you to talk to data sources and data things. Taps are what allow your application to talk to the sources of information and things of information. Because you have a reusable Tap, you can use it to connect to a different flow later on.
On top, you have the configuration where you have configured this application run on Hadoop. It is very easy to change this line of code and say, “No, I’m not going to run it on Hadoop. I’m going to test it locally and see how it performs.” Okay, tomorrow, some other framework might come in. If I have the flow connector for it, I can just simply change this line of code and the code will execute on that platform unchanged. Lastly, the scheduling aspect. This is a very interesting aspect because we see that a flow has been connected using the Hadoop flow planner. But this flow is not executed unless you ask it to.
You can create a reusable library of flow and pass it around in your enterprise and have other developers use it. The key takeaways are that you are not thinking in terms of MapReduce and Cascading. You are using the familiar metaphors which you have learned during your initial data engineering with Tap. Your flows are reusable. You can have a library of reusable components leveraging the best software engineering practices. As I was saying in the previous slide, you’re not thinking in terms of MapReduce. Cascading models the entire world as a stream of Topos which are Scala values coming out of a data source.
You are then layering on filters, functions, aggregators, buffers, these reusable logic blocks on it, which transform the data according to your business needs. Cascading, because it was discovered in 2008 and has been deployed at so many enterprises, very use case driven. We have a large library of components for you to use. If your enterprise requires a certain function which is not in Cascading, it is very easy to write your own. There is only one method to override, which is Operate, and it is very simple to roll your own operators and functions. So far, we have talked about how it is easy to develop applications in Cascading. But one must keep in mind that applications do not just run on Hadoop.
They also need to talk to different data sources and data things. Because Cascading is open source, people have been writing data connectors which we call Taps in Cascading world. There’s a website cascading.org/extensions which I encourage you to visit. It shows you all the different data connectors which are being built by Concurrent and by the community. Can’t you see that Cascading has become the de facto standard for Big Data application development? It allows you to build apps which are scale-free because we have so much experience in figuring out what patterns work and where they work in which cases they are best. We allow you to write apps which are scale-free. You don’t have to worry about the nitty gritties of joints. We give you the logic blocks which you can use to construct your joints.
You can easily leverage your existing Java developers without any Big Data experience. You can have them write a Cascading app within a few hours of training. Because Cascading is a Java API, it leverages years and years of software engineering best practices and test-driven development practices. It comes inbuilt with checks, taps, assertions checkpoints et cetera, which allow you to develop apps which are reusable testable and robust. You’re not going to get a call at 3:00 AM in the morning because you were doing your data analysis using ad hoc scripts, using Cascading. Sorry about this.
There’s just a lot of downloads going on in Cascading but let’s talk about Scalding which is a layer of Scala or Cascading which Twitter uses. Now, because Cascading is Java API, you can actually provide your own DSL on top of it. A couple of them are in existence right now. One is called Scalading catalog. Scalding is a language binding into Cascading for Scala. It is excellent for writing crisp mathematical constructs for doing matrix math. Most of the revenue-producing apps inside Twitter use Scalding under covers and so does eBay. As I was saying earlier, not only does Cascading allow you to develop apps easily, it also allows you to integrate to different data sources.
One of the key products around Cascading is called Lingual. Lingual is a SQL 92 layer on top of Cascading. It’s an extension to Cascading which allows you to copy paste your ETL workflows into Cascading. You don’t have to give huge investment in migrating your workload from a former ETL-based land to Hadoop. You can also use Lingual to connect to any data source which is JDBC compliant. Here you have a powerful framework which not only allows you to accelerate data development, but it also allows you to integrate to other disparate data resources. Data App Development is fine, but once the app is up and running you also want to monitor it because in enterprises SLAs are everything.
You cannot improve what you cannot measure. You need a tool which can give you insight on how your application is performing at one time. For this we have a product called Driven. Driven allows you to look into how your application is constructed, what the DAD looks like and how each of those constructs using the MapReduce constructs in Hadoop they execute certain Cascading application blocks. Driven allows you to see how your Cascading app got mapped to a different map reducer, how much time each one took. Where you can operationalize your apps, where should you spend time in making your app factor. We’ll come to that in a little bit.
In summary one can build really robust data apps right the first time with Cascading. It allows you to intuitively think in metaphors which you have learnt during your basic programming classes. Scalding is an extension to Cascading using Scala, which allows you to develop crisp programming construct for matrix manipulations. Driven is an application visualization product, which provides rich insight into how your application is executing at runtime so you can further bolster your application, debug it, et cetera. With that, I will pass on the controls to Ashish, which will take you through the Qubole part of the webinar.
Ashish: Thank you, Dhruv, thanks for my solo view of Cascading which solves the problems of programming aspect of MapReduce and getting away from the complexity. Thank you for that. Here I’m going to take over and I’ll be talking about Qubole. Really quick, my name is Ashish Tube and I am a solutions architect at Qubole. Today, I’m going to talk about the other aspect, other pain point where we deal with the infrastructure side whenever any enterprise is in the plans for setting up the larger infrastructure, there are a lot of dilemmas, lot of problems.
We provide Big Data as a service and I’ll be talking about how it helps and why it is really helpful and is a very strong alternative for thinking about other traditional way of setting things up on premise. Today, I’ll be touching upon the product, Qubole, and features architecture of Qubole, and then we will walk through a quick demo. I’ll also try to show a real sample of Cascading and how can somebody run Cascading app at Qubole. A very quick overview of Qubole. Qubole is a company which was founded in 2011 by some of the pioneers of Big Data industry. Ashish Thusoo and Joydeep Sen Sarma, they are the founders.
They handled the entire infrastructure at Facebook and also, they were co-creator of Hive Project and Hive is another SQL-based abstract layer on top of MapReduce which provides an abstraction at the SQL level. We are funded by Lightspeed Ventures and Charles River and we are based in Mountain View as well as Bangalore, India. Really quick, some service stats. We are in the production since late 2012 and currently, we are the scale of around 250,000 nodes every month. We are processing close to 30 petabyte data every month, which is growing rapidly. It was like five petabytes in January and now you can see the scale how it grew.
We have the deployment in the range of 10 nodes to 2,000 node clusters so you can imagine like it’s pretty much on the upper side of the industry where we deal with 2,000 node clusters level, so that kind of service stats. Now I’m going to talk about the goal of the company. That’s really important and that can answer some of your questions, what Qubole is and what services we provide. Our goal is to provide 100% managed services. Let me take a step back and talk about what happens when an enterprise tries to set up Big Data infrastructure. There’s a lot of questions, lot of discussions about, what should be the cluster configuration? What should be the type of servers you are going to procure?
What is the right choice of machine type? That’s a very interesting part and that’s where a lot of enterprises make mistakes and that’s a history of some failures in the Big Data project. We are, basically, solving that problem by providing a 100% managed service where you don’t really have to think about that aspect. You just have to sign in and you’re ready with your Big Data analytics. That’s one goal. Second is mostly in the industry, Big Data services are actually consumed by analysts or Big Data scientists. They do not like to deal with day-to-day problems of Hadoop systems and very internals of the system where you have to log in and see what processes are running, what is not running.
We are solving the problem of the backend and providing a very nice UI on top so that data analysts and data scientists they just have to deal with the UI. Pretty intuitive UI, they can just log in and run their queries. They don’t have to think about clusters, what clusters are running? Where is my cluster running? What if my job track arrived? What if some component of my Hadoop cluster dies? We take care of everything automatically and replace things, do all the fault recovery and stuff like that. The third one is the data SWAT team.
We provide a very great support from some of the experts in the industry. We have a very great experience and maturity where people come with more than a decade’s experience. They have contributed with lot of bigger systems like Vertica or Green Plum. We have those folks who are supporting our services so that’s another goal and we are committed to that. Very popular question by Big Data as a service, why shouldn’t I think about other ways that– I can go with setting up my own clusters. I partly answered that question in my first slide but there are few key points. The time to deployment.
Obviously, if you think about setting up a cluster of say just 10 nodes, it takes time. Having experience of that part I can say it’s not something you can do within few days. It might take a week or two weeks because there could be a lot of issues with integration. To make that system run perfectly, it takes time. That’s one thing that such kind of service that Qubole can help you set up your clusters within few minutes and you are ready with your analytics. The second part of this is ease of use. Basically, using such services you can simply log into the UI which I was describing and you can just start running your Hive queries or MapReduce jobs, Cascading jobs.
You can run your machine learning using [Apache] Mahout. You can pretty much leverage the entire ecosystem of Hadoop like Oozie or Scoop any component you think about. It’s pretty pluggable within your cloud environment and you’re ready within minutes. Then next is cost related. It reduces the risk. When you think about setting these things constantly it is associated with a huge cost and if you make the wrong decision and when you have to revert back a plant, it’s loss of money.
As in the Big Data as a service model you just pay as you go and whenever you think that this is not a compatible configuration for us, you can quickly change it within minutes and you are ready with another set of clusters and you’re not paying for anything which you did before. That’s another aspect and that’s a very key feature. That’s why the Big Data as a service is being popular and it’s helping industry in a big way. The last one is support. Obviously, then you deploy things on premise, the support is your responsibility. Obviously, some of the vendors they provide support but that’s also cost associated. You have to pay for the support and things like that. At the service model, you are given the support by default and that’s another virtue of this model.
Some of the core features and what distinguishes Qubole from other competitors, other vendors, in the same space, the first one which is really really important and is a unique feature of Qubole is auto-scaling. That’s true auto-scaling. I’ll talk about a little bit how it is, why did I say true auto-scaling. Basically, when you run a job and your team runs multiple jobs, you don’t have to really think about how many machines I should be spending or sparing on this cluster. You just simply go on and in the beginning of your configuration, you just set a range of machines which you can according to your budget you decide.
That range can be like five nodes to 200 nodes and then we take care of everything. Let’s say you’re running a very high load and we figure that out on the fly and based on that calculation, we figure out that you require 50 nodes at this time. We basically expand the cluster and it goes up to 50 nodes. Imagine if some other team members have put some more load, it can automatically go up to 80 or up to your maximum limit be it like 200 or 2,000. That’s where it basically provides a true auto-scaling feature. Another side is we also take care of the scale-down feature.
When your load subsides and your team is almost done with their queries, what we do we automatically figure that out and we release those extra nodes so that you can save costs on the compute side. That’s why it is true auto-scaling. The second one is the fastest, Hadoop running on the cloud. As you see the background of the company, how it was founded and who found that, these guys come from a very rigorous experience of Big Data. They did a lot of optimizations keeping in mind that this is going to run on the cloud storage. Because, fundamentally, Hadoop was built for HDFS layout. Those optimizations make our system very fast and it’s the fastest Hadoop in the cloud so far.
We also provide pre-built connectors which helps you connect to external data sources, your relational databases, and you can pull that data on a common ground which is your cloud storage. Then you can do all the analytics using tools like Cascading, Hive, Pig or anything. We also provide the job scheduler. You don’t have to worry about running the jobs manually if there is a clean ETL flow you simply set up those scheduler jobs and free you time, taking the pain off running those things manually. Some additional features, you basically consolidate all workloads into a single cluster.
However, we support multiple cluster notion, also you can create any number of clusters within your account. But our primary goal is to provide a single point of execution where you can expand your cluster to any limit. That helps in terms of utilization because it’s all better in terms of utilization if you run 10 clusters of 10 nodes compared to 100 node clusters when a model gives you much more performance and it turns out to be a better choice. We provide the mix and match reserved and spot instances model so that you can leverage the spot instances of clouds like AWS and you can save costs on that fact. Also, we provide the UI too.
Explore the data and analyze the data, do some sampling. We also provide the ODBC data connectors so that you can pretty much connect Qubole layer from the BI tools like Tableau or any other tools like MicroStrategy, Pentaho, MSX or anything. All these functionalities are exposed through API. We can pretty much integrate that part in your programs and you can even call those APIs and leverage the APIs. This is a very consolidated architecture picture and you see at the center core, you see the Qubole layer which provides the auto-scaling feature, job scheduling functionality in both flows which makes your ETL pipeline dynamic instead of fixed synchronous pipelines.
Also, the external entities like BI to inform them on site, we provide the UI. It’s a summary of what I just talked about. We support the data connectors with external data sources that you can see on the side. That’s our overall summary of our entire conversation here. The last, not the least, I’ll talk about a little bit about the architecture. We have made sure that our architecture’s pluggable so that any point in time if there is a new tool coming in, you can pretty much integrate with any tool or if you want to install some libraries on your Hadoop cluster service, we do not do by default. You can do that.
We provide all these options so that you can manage the cluster from your side as well if you want to do some third-party installations and stuff like that. We also provide the ways to integrate with external tools like Cascading, Mahout, Hive, or anything. It’s about the entire ecosystem. The Shell command interface which is one of the options on the UI, I’ll cover that, but that helps you run any commands on your cloud service which enables the capability to run any type of integration or any type of installation on your cluster. That gives the flexibility of integrating things. With that said, I’m going to quickly switch and share my desktop and show a very little demo.
Really quick I’ll be sharing my desktop here. Hope you can see my screen. That’s our UI which I was talking about. You simply log in here and you land on a page where you will see the history of all my queries which I run and on the left side, you can see all the jobs which I ran just now. On the right side of the window, you see the option of that’s a composer and you see the multiple options like you can run Hive query. Now, you can also run the Presto queries. Presto is an in-memory engine which we support on the same box. You don’t have to go anywhere else. You just run the Presto query and figure out whether that works out well or not.
Presto is not a solution for everything but it works for certain queries. Compared to Hive provide 5x to 10x performance boost. Even on large queries you can pretty much run the Shell command which I was talking about and do whatever operations you want to do. Also, the MapReduce jobs and I’ll show how to run a Cascading job. Also, we provide the data export and import capability which I just talked about that you can pretty much integrate from your external data sources and you can do data export. You can run your analytics on the Hive side and populate the data sets into your relational database and vice versa. You can do data import as well.
There is an option to run some DB query as well. For example, if you have an external data source say MySQL or Vertica and you want to just do some data discovery query. You can select the DB query and you can run some kind of count * or any type of query which will give you an idea of the data. The very nice thing is the workflow which I talked about in the context of scheduling and the ETL pipeline setup. What you can do is you can set up a workflow and workflow is nothing but a heterogeneous flow of the jobs which you can even tie any number of jobs of different types. Like first job is Hive second job is Pig and things like that. You can run those ETL things.
You can also schedule on this window, I’m showing, where you can set up the scheduled jobs and stuff like that which will run on a certain period. I’m not going to stress on the that. We have those webinars every alternate week, so probably you can join that later. Right now I’m going to show a really quick example of Cascading app. This is my word count app showing a bit of Cascading code and this is going to just give some word count from the Avro files. I’ve provided the job location. It runs a Hadoop job, so as you can see on top I selected a Hadoop job and this is my input and that’s my class name which I wrote for doing the word count and this is my output directories. That’s where it’s going to load the output.
I just need to press the button run. I don’t need to really look at whether my cluster is up or not. Whereas you can actually see the cluster status from the top drop down and you see the green indicator which basically tells that my cluster is up, but imagine if it was not running, this query it will automatically bring it up so you don’t have to manually do that and in the real time you can see all the logs underneath. Also with the Cascading, there’s a nice thing called DRIVEN which Dhruv will be talking about very soon. What it provides is a very nice picture of your MapReduce job in a form of traditional ETL jobs.
Here on the Cascading window, you see my word count is interpreted as there’s a green button which is a source and there is a grouping component and there is an aggregator max, and then there is a thing which basically dumps the data into our final location. You can see all the steps which Dhruv will be talking in detail but you can see reads and stuff like that.
You can also see the MapReduce picture if you know about that. However, the Cascading approach is enough to remove that- and provide the abstraction at the level of just generic construct site. I’ll let Dhruv cover that part later but you can also look at the MapReduce file like job tracker URL. On my job URL, you see there is a job tracker URL really quick. Anyways, that’s pretty much it from the demo side and I’ll hand it over to Dhruv now and he can probably go over the detailed DRIVEN demo from here. Let me draw back.
Dhruv Kumar: Hey, thanks a lot, Ashish. Thanks for showing that beautiful word count app how it runs on auto-scaling in Qubole cluster. Guys, I’m going to quickly take over from Ashish now and show you how DRIVEN looks like in a non-trivial app. Everyone likes word count but let’s see how it performs on a bigger scale. Over here, I have the DRIVEN UI which is showing a fairly complex app which I ran locally on my laptop couple of nights ago.
As we were saying earlier when you run your Hadoop job, basically the job which shows a Cascading app you get a link in the logs which if you go to that URL it will take you to the DRIVEN server location and you can see this dag which is constructed. This dag was not authored by hand. This is not a sleight of hand. This was constructed by Cascading when it was running on Hadoop. In this case, it was running locally on my box.
As you see this complex app is constructed out of different flows. Now, these flows from left to right are filter store, filter custom demographics et cetera, et cetera. Now, there are very, very interesting things on this UI. For instance, as you can see I can go ahead and click each source and see what type of data was I collecting from it. Furthermore, I can go down and see what type of logic block I connected it with.
Imagine you’re a Java developer, you’re writing your Cascading app but you want to quickly visualize how it’s going to look like whether you are making any mistakes or not. You can do your app development, you can quickly run it in local mode and it is going to give you a link which you can go to and visualize it. This becomes very powerful for logical analysis when you’re creating your application.
Now, this is the end product. This is after the whole application was run on Hadoop. Now over here we see that it also shows me how much time each of these logical blocks took for execution. For instance, the filter store logical block took one minute and 23 seconds. Not only that, it also shows me how much time did the filter store block took in the submitted portion. When you submit this job to the Hadoop cluster the jobs are waiting in line for their turn.
You can see that the filter store operational block, logically speaking, spent a bunch of its time just because it was waiting for its turn to get executed. This was because my laptop is small I can only run that many MapReduce tasks at the same time but if you were to run in an actual cluster it’ll be faster. Further, if you can drill down on each of these logic blocks and we can see how it gets mapped to an actual MapReduce job. For instance, this is a straight path through operation so this only requires us to have a mapper.
You can see that this particular flow got mapped to only one mapper. Furthermore it also shows me how many slices this job took. Now slices in Cascading parlance or DRIVEN parlance means the MapReduce tasks running in parallel. Over here there were two mappers which are running parallel so there are two slices. I can go in and look at what run time was associated with each. Because it’s a fairly even distribution I’m not too worried. I’m saying that, “Okay, well, both of these tasks took the same amount of time so there’s not much room for improvement here. If I parallelize it maybe it’ll run faster.”
Now there are some default counters which it shows the run time, the read time but you can easily add more counters to it. For instance, you can go inside and you can see, “Okay, well, because I was using a file output format I can see the bytes written. If I click on that I now get a different column with Bytes Written. Not only that I can push down my Hadoop custom counters into the UI as well, because all the Cascading app is doing at run time is sending out metadata to the back-end DRIVEN server.
Now, this becomes very powerful when you talk about application monitoring at run time. Because you know how much time each particular logic block is taking you know where to go inside and optimize your job. In fact in one of our customer’s cases they came to us saying that, “Okay, well, can you help us optimize this job?” and we were like, “Okay, well, let’s pull up DRIVEN.” When we looked at DRIVEN it became very clear that one of the filtration processes which they were doing right at the end, it could be moved out to the front without any loss in the business logic.
This is a very quick overview of what DRIVEN gives you and you can use Cascading without DRIVEN but DRIVEN is free to download and you can test drive it, you can see how it performs and then you can in touch with us if you need more support. That’s a quick DRIVEN demo from my side, Ashish.
Ashish: Thanks for those wonderful. Now we’re opening up for the questions and answers. I see a bunch of questions in the pipeline. Probably Ali can drive that session and we can take up questions one after another.
Ali over to you.
Ali: Great, thank you, guys. We have a couple of questions in and some more coming in as we speak but let’s kick it off with the first question for Dhruv. Somebody asks, Cascading seems similar to Pig/Hive. Could you please compare Cascading to Pig/Hive and, for example, what do I get from both Cascading and Pig and Hive? What does Cascading give me that Pig and Hive does not?
Dhruv: That’s a great question and that’s a very pertinent question as well. Pig is a scripting language and so is Hive. They are both very useful for ad-hoc analysis of data. Pig, Hive and Cascading all three abstract your thinking away from MapReduce. However, for production grade applications that produce data product, application logic that is built on stringing together a sequence of script is very, very bad. For instance, it is not possible to develop and debug real-world application.
Now Pig and Hive are scripting languages, they are tough to deploy and even tougher to monitor and tune as the applications mature. Now, when it comes to app development both Pig and Hive have shortcomings when dealing with complexity and testing. The Pig and Hive easy problems can be easily solved. You can start your app development and it’s easy to start at first but as the complexity grows it becomes really, really hard to monitor, debug, and diagnose your app. I think that’s a short comparison between Pig, Cascading, and Hive.
Ali: Great, moving on, a question for both of you, I suppose. What version of Cascading does the Qubole cluster use, and is DRIVEN dependent on certain Cascading versions?
Ashish: Sure, I’ll take the first part then I’ll let Dhruv answer the second one. As I mentioned while I was showing the slide dag, we have a very applicable architecture. It’s sticky in nature. What that means is we support any version of Cascading. Personally I have run my programs using Cascading 2.0 as well as 2.5.5. That’s pretty open, very flexible, you can compile those things and we are compatible so we run those compatibility suites to make sure that we are 100% compatible. The second part.
Dhruv: VGA DRIVEN last month and that runs beautifully on 2.5 but we have backward compatibility with Cascading 2.1.
Ali: Someone asks, generally, how do you support OpenStack?
Ashish: I’ll take that. The Qubole so far we are on production with AWS Cloud as well as Google Cloud. We have done the piloting work with OpenStack but we do not have any customer on OpenStack. We have a flexible architecture internally we have designed so we kept that in mind and we made sure that we roll support OpenStack whenever it counts. Also, we have other plans in our roadmap. Sooner you might hear from us that like Windows Azure or some other clouds, we are still evaluating, and we’ll be rolling out the support for other clouds as well. I hope that answers.
Ali: This next question back to Dhruv. How does one use Driven to identify specifically what may be be causing a bottleneck in a workflow?
Dhruv: That’s an excellent question. When I showed you the Driven UI, I showed you how the logical DAD gets mapped or gets translated into MapReduce. We call that the runtime view. You can drill down into that portion of Driven UI, and you can easily see which logic of your application is executing on which map reducer and furthermore how much time it is taking. Using that, you can easily see, “This logic block is taking a long time, but why is it taking a long time? Is it because I don’t have enough resources in my cluster or is it because I’m doing a data manipulation which is probably not needed here?” That way you can easily optimize your job using Driven and you can also further improve on your SLA needs.
Ali: Back to Ashish here. Someone asks how does auto-scaling work with Qubole?
Ashish: That’s a wonderful question and I’d love to answer that question. Thanks for asking. Auto-scaling is a very unique feature and what happens when you run any sort of job, be it Cascading or Hive or Pig or anything. It actually gets interpreted into MapReduce program. It gets submitted to the cluster and then it gets split into multiple mappers. Depending upon the data size it could be like hundreds of mappers, thousands of mappers, tens of mappers. What we do in the back end, Qubole engine keeps an eye, keeps monitoring the progress of each mapper and we also see how many machines do you have in your clusters.
Every machine, every node has the capability to run a fixed number of mappers depending on what is the slot setting. Depending upon that, we see that how many mappers are in the queue and how many are running, how they are progressing. Based on all these, we do some mathematical calculations. We have some algorithms and figure that out, that now this is not sufficient, this is not going to deliver your job results within the time which you expect. That’s the point, that’s where we basically get that number that, you need extra number of nodes like 40 or 50 or whatsoever.
Then we also keep check at what is your maximum setting which you have done in the beginning. We never go beyond that. For example, the number which we evaluated and said the 40 nodes it is within your maximum number of nodes we will spin up those nodes on the fly. After a few minutes, you will see that your cluster size has gone up and now your mappers are running much more faster. There’s no queue, queue is really clearing very fast. That’s how we deal with auto-scaling and same thing happens when your cluster is going on the idle side. It’s ordering so many mappers.
So, what we do? We keep checking the same thing again and again that there are engines running in the back end. We do the same thing, we figure out how many nodes are sitting idle. If they are pretty much have no dependency on the data sitting on their HDFS layer, we start tearing down at those machines as well. That’s how the scaling up and scaling down actually happens in the back end. I hope that answers the question. Do we have more questions?
Ali: Hey guys can you hear me?
Ali: Audio difficulty here, I apologize. We just had a question come in. Is there API or command line interface to programmatically construct and launch a Cascading job periodically like we can with the AWS CLI? That’ll be for Dhruv.
Dhruv: Cascading is a Java API. You’re going to write your application using Java programming language. You’re going to construct it in the IDE, you’re going to test drive it locally to see everything works, see it in Driven. Then when you are ready, you can use any external tool to push it out to the CLI or AWS or in case of Qubole, I think you might be able to answer that?
Ashish: As I mentioned on one of my slides, we do provide the API support, using that API. Since Cascading job and I showed it ran as a Hadoop job. We do support those jobs to be kicked off from API, so you can pretty much use Cascading jobs and you can kick off those jobs. The answer is yes, using Qubole you can do that. Since we support scheduling as well, so you can put those schedulers on, and with some certain periodicity and that’s how you can run those on a schedule.
Ali: You can call on the Qubole API and have the job submitted to your cluster.
Ashish: Absolutely, and we have the RFDK public at the Python forum, so you can pretty much download that. Send the question. We can provide the support how to do that.
Ali: Qubole question here for Ashish. It asks, we have data sitting on our on-premise cluster, can I use Qubole to process that data?
Ashish: The answer is yes and no. The good thing is that there is a yes. Basically, when you deploy and you run Qubole the prerequisite is that you have to have your data on the cloud storage. In case of AWS, it should be S3, in case of Google Cloud, it should be Google storage. The answer why it is yes, let’s assume you have data sitting on your on-premise TFS clusters. You can pretty much use Hadoop just CD command or there are the other tools available. You can transfer the data and put it on the S3 and that’s how you’re ready to start running Qubole against that data. That’s the way many of our customers when they’re on-board they did the same exercise and that’s how they started.
Ali: Another question for Dhruv here. Where would one use Hive relative to Cascading?
Dhruv: I would say that just to start out with your data analysis, if you feel comfortable with just a command-line interface and basic scripting Hive is all right but as soon as you get past that initial stage, then you want to operationalize your data apps. You want to have the comfort that you have created a robust app which can run at scale, you move to Cascading. I would in fact go on to say that with Lingual which allows you to take your existing ETL workload and migrate to Cascading you should just start off with Lingual for your exploratory data analysis.
Ali: One final question here. I believe this one is directed towards Ashish again. It asks, how do I compare Presto with Impala or Spark?
Ashish: That’s a very interesting question, it’s a very difficult question though. Let me give you a little background. Presto, Impala, Spark they’re trying to solve the same problem actually. They come from different companies, different development efforts in the back end. They’re trying to solve the same problem. We adopted Presto because it comes out of Facebook and it is in production. They are running it on thousands of nodes every day. That’s where we believe that it is tried and tested. The model is in memory processing. What it tries to do is, it is not a MapReduce model. MapReduce since it comes with the classic problem of latency, that’s what it is trying to address.
It’s not a solution for everything, it’s not a replacement for all the tools like Hive, MapReduce, Cascading or anything. It’s just another tool and you need to decide that, “This is my workload which runs really fast at Presto.” Like some of the basic discovery type of queries like you want to see some of the trends of maximum and the order by who is the most trend customer. Something like that. Those queries which you want to do for your data discovery, those are the queries run faster and very well on Presto and I believe on Spark and Impala as well. They’re trying to solve the same problem, but we do support Presto on our platform without doing any integration work and we are in the process of evaluating Spark as well.
Ali: Thank you so much for that. With that this webinar comes to a close. A very sincere thank you, again, to everyone who joined us today. As I mentioned before, look out for a follow-up email from us with a link to an on-demand version of this webinar. Please, feel free to send it out to whomever else you feel might be interested in this, and also look out for our forthcoming webinars. We do this monthly and we really appreciate your interest and joining us again. With that, have a fantastic day, folks, and we look forward to hearing from you soon.