Scalable Data Pipelines

Thank you everyone, for joining our Webinar on how to build scalable data pipelines for machine learning. We will be recording today’s webinar, and we’ll share the recording and presentation following today’s live session. If you have any questions, please submit them and we will cover during the Q and A portion of our Webinar. I want to take a moment to thank our speakers today. 


Excellent. Thank you, Steve. Let’s talk about the structure of this webinar. First, we will do an overview of the Qubole platform. We will discuss the process and the engines used for building machine learning data pipelines, followed by a live demo of the Qubole platform. We will then cover the architectures used by two of our customers, essentially how they leverage Qubole, followed by coverage of why our customers choose Qubole. We will wrap it up with Q and A. Let’s take a quick look at the platform. Qubole is a managed cloud native data platform. What do we mean by that? We provide services to multiple types of personas using multiple engines such as Hive, Hadoop, Spark, and Presto, along with frameworks such as Airflow and TensorFlow. This allows the personas that you see here to be more productive and efficient. We help platform administrators. These are the guys who actually care about the infrastructure, the security, the financial governance, and providing self service to their users. 

Machine Learning

We also help data engineers by giving them the engines to build their pipelines for machine learning and business intelligence. We also help data analysts for whom we provide interfaces to do interactive SQL analysis to gain greater insight from their data sources. We also, of course, provide data scientists with tools such as notebooks so that they can build models using Scala, Java, R, and Python, which are their preferred tools. We run on multiple clouds. Customers can choose their public cloud provider from Amazon, or Microsoft. An important thing to note is that we enable one admin to support up to 50 to 100 users. We know cases of one admin supporting up to 185 users as well. We all see and experience the results of those machine learning data pipelines and specifically those models that have been deployed leveraging those data pipelines. It could be something as simple as a credit card transaction that gets paused until you’re able to provide verification. 

As a matter of fact, this came from my phone. It was a couple of weeks ago. I happened to be in New York City. I live in the Bay Area, and then my bank essentially paused that transaction as I was getting a ride to the airport. In order for us to experience this type of message, data scientists at our banks have gone through the process of first creating machine learning data pipelines and then training models. What we see here is the result of that model in operation. Luckily, those data scientists also work with user experience experts. I was able to respond back to that message and essentially get my ride to the airport. Those machine learning data pipelines are there. We choose sometimes not to pay too much attention to them. This is the result of one of those machine learning data models. I don’t see this every time I go and make a purchase with my credit card, which is a good thing. 

Big Data Pipelines

Okay, so in order to build those pipelines, there is a process, right? This process has been around for a while, even before people started talking about artificial intelligence and machine learning. Before that were talking about analytics. Right? These pipelines start with the data exploration phase, then building the data pipelines. Once the pipelines have been built, data engineers orchestrate them, which is the process of scheduling and automating their operation. Finally, these pipelines deliver data sets for analytics and machine learning models. Let’s take a look at some of the potential pitfalls when starting machine learning data science project. One of the things we often see is that people select a platform that only supports one engine. Building a data pipeline is a lot like building a house. You will require multiple tools. The other one is not giving the information to the users that need them. Generally this happens not out of ill will. 

It happens because the data teams get overrun. You want to make sure that you’re giving the users all the data that they need. The third one, matter of fact, let me come back and very common is cost overruns. People are not able to estimate the cost of building and operating those models. What are some of the key requirements? Well, they address those pitfalls and first of all is the ability to support multiple engines. The platform that you choose should be able to support multiple big data engines. You will be able to enable more users by also choosing a platform that provides you a highend user toadmin ratio. That’s what we discussed about Qubole. It allows you to support one admin to support 5100, in some cases up to 185 users. Of course, your platform should natively take advantage of the scalability of the cloud and must support multiple public clouds so that you’re not locked in. 

Machine Learning Pipeline Architecture

What are those engines that we’re talking about? The most commonly used engines for building these machine learning data pipelines are Hive, Spark, and here I have Airflow, which is not quite an engine, it’s more of a framework that allows you to orchestrate the flow of these data pipelines. We have a process, we discussed that process had a pictorial of it. We have engines, and the question becomes, how do I put all of this together? Right? When and where should I leverage these engines? So, let’s start first by taking a look at the different engines. 

Hadoop Hive

First one is Hive. That’s the oldest one of these engines, used most commonly for batch processing and of course, ETL processing. ETL as in extraction, transformation, and loading for exploring large volumes of data. Some of them could be structured or unstructured, and it’s used heavily because it’s inexpensive to persist large data sets using Hive. It’s important to note that Hive is distributed scalable, full, tolerant, and flexible. 

Apache Spark

Spark, which is a relatively newer engine. This is a general purpose in memory. Computational engine, used mostly for batch processing. ETL, again, it has facilities for graph processing and also for stream processing. Now, it has all of the attributes that Hive has, but it’s also fast because it’s all done in memory. All of its processing, it’s done in memory, which is important. 


Now, Presto is the third engine, and it’s an open source SQL query engine. It’s been optimized to run SQL and it’s very good for interactive ad hoc queries. Also operational queries supporting business intelligence tools such as Dashboards and so on. Also to explore data sets that could be very large. It has that ability to join multiple data sources, which makes it very handy. Presto is distributed scalable, flexible, and fast. It has limited fault tolerance, right? You may not want to run long queries using Presto. Yeah, it’s better suited for interactive, well defined queries. We have Airflow, which again, is not an engine, it’s a framework. It’s an open source workflow tool, mostly used for programmatically authoring workflows. 


So you create DAGs. The DAGs are Python code, and it allows you to schedule and monitor those workflows. Airflow is also distributed scalable and flexible, and this takes us to this chart that allows us to put it all together for exploration of data. We actually can use all of the engines that we discuss hive, Spark and Presto. The question then becomes, okay, when should I use each one of these? Okay, so if you’re doing SQL on larger data sets, hive may be very cost effective if you need programmatic access to the data in order to explore it. 

Sometimes you may find data formats. For example, Healthcare Insurance has those formats called a 35 and 837, and those are hierarchical formats. It’s not easy to traverse them unless you have a programmatic tool. So you could perhaps write Python. Spark is probably the better engine in that type of instance. It can also do SQL. And then we have presto. If you’re running interactive queries and the response time is important, presto is probably the better tool to use now for building those data pipelines. You’re doing batch processing, you could use Hive or Spark. The closer you get to a realtime requirement, spark becomes probably the most relevant and the engine that you want to use. You may get away with micro batching with Hive, but Spark has better facilities to support you near real time and also with streaming data. Notice that I didn’t put pressure here because it has limited fault tolerance, right? 

For that reason, you may not want to build your pipeline using Presto. For orchestration, of course you would use Airflow. This works very well for batch workflows. You can programmatically control and version and monitor all of your workflows through Airflow. And then to deliver the data set. Again, you could use all of these engines for larger data sets that are going to be persisted. Hive works really well. If you’re doing anything programmatic right, maybe you want to consider Spark SQL as well. Presto, if you’re going to deliver this data to analysts that need to perform quick ad hoc interactive queries, can also perform federated queries with multiple data sources as well. If you want to look deeper into these engines, if you want faster response time, you’re better off sticking with Spark and Presto. Hive is very effective, but it’s a batch process. It’ll take a while for you to get your data back. 


As you build those data pipelines and you need to adhere to an SLA, I would stick with Hive and Spark. Preston, not so much. Although I’ve heard of customers using it, as long as they can control and their queries are not running for a long period of time, you could do it. Fault tolerance. Again, it’s high with Hive and Spark, and Presto has limited full tolerance. If the choice of language is important, then Equal works well with Hive and of course, Presto and Spark. If you need programmatic access, I would stick to Spark. Another point to help you decide is the type of workload, right. Batch processing. Stay with Hive. Larger data sets stay with Hive. If you’re performing machine learning, graph processing, streaming type of pipelines, then go with Spark. Presto again, very good for interactive queries, right? Interactive exploration as well. Spark SQL works really well. 

Spark has a facility that plans for failure so that it’s resilient after failure. That planning for failure has a cost. Presto doesn’t have that. If you just want a quick response, you’re okay. Rerunning the query, then Presto is probably the better tool. And this takes us to the demo. 

Thanks, Harry. I’m going to share my screen. Let me know when you see my screen. We see it. Thank you. 

Scalable Data Pipelines

All right, thank you. What I’ll do is I’ll start with my demo with a notebook. Again, I have embedded an infographic which depicts the architecture of this solution. We’ve heard a lot about what choices to make when you’re building your machine learning pipeline with respect to the engines, with respect to the tools and technologies. We’re going to put them to practice in this demo. So I’m going to zoom in here. 

This pipeline, this machine learning pipeline is actually demonstrating a real life use case of Netflix recommendations, Netflix like recommendations. This pipeline is split into three parts. In the first part of the pipeline, we are going to ingest the Movie Lens data. Again, Movie Lens is a publicly source data which is available from a website called We’ll ingest that Movie Lens data and extract that data, transform that data and load it into a Hive table. 

The first part of this pipeline will use Airflow and Hive to do this particular tasks. In the second pipeline, we will go into the machine learning aspects of this pipeline. We have a big curated data from the Movie Lens database in the Hive database. We’ll take that data and put that into Spark’s distributed memory. We will run some Spark ML algorithms, train them on the given data to be able to recommend movies to all our users. If you look at the Netflix kind of use case, each user rates perhaps maybe 20 movies, right? There are thousands and thousands of movies that the users didn’t watch. The problem that we are trying to solve here is that how do we recommend movies to the user in context? We’ll use an algorithm called collaborative filtering and we’ll apply that algorithm, train the model, validate that algorithm, and then score that algorithm. 

We’ll score the model basically by scoring. What I mean is we will run through all the users and build top ten recommendations for each of the users in our database. The third and the final piece of the pipeline, which I’m not going to cover we can briefly talk about it at the end of the demo. This is the deployment piece. Once you have your recommendations, we’re going to move those recommendations to a database that your website like uses to start with an intuition for collaborative filtering. Just let’s understand how the collaborative filtering algorithm works. Collaborative filtering algorithm works by matching users. What it does is it tries to match a user which is close to the user in context in terms of preferences, likes and dislikes. Basically it uses that users preferences and recommends products for the user in context. The products are items that the user in context has not watched or has not viewed. 

That’s really the basic intuition for collaborative filtering. Coming to how it works behind the scenes, it’s based on a mathematical concept called low rank matrix factorization. If you look at the ratings kind of data set, right, it’s a matrix, right? Each user has rated like at least 20 or 30 movies. There are thousands and thousands of movies that he has not rated, he or she has not rated. So it’s a Spark matrix. Really the solution to this problem is that this Spark matrix is decomposed into two low rank dense matrices. Once those dense matrices are realized, we just do the multiplication or product of those matrices to predict the probability of this particular user in context, liking or disliking the movies, the other movies that he has not watched. So, just to give an example, I’m taking a four by four matrix. So we have four items, four users. 

This is a Spark matrix. This matrix, when we apply this algorithm, it actually decomposes into two low rank matrices. One is a user matrix and an item matrix. Once we actually approximately calculate these dense matrices and we do a cross product across these matrices, we are able to fill these blank spaces, right? Based on those rating values, we are able to recommend items or movies to the users. That’s really how this algorithm works behind the scenes. Now I’ll switch to the Airflow, because Airflow is what really stitches this pipeline together. So Airflow is a workflow management tool. Again, this airflow is provided by Qubole. You create a new cluster, cue ball offers a choice of engines. Airflow is one of those services offered by Qborg. In the Airflow, I have registered a DAG already and basically created all the pipeline that we discussed, right, the pipeline which loads the data and curates the data using ETL, some data blending, and then creates the hive tables that are required and runs a spark notebook. 

This spark notebook is really what I was actually showing you. This is all automated, managed by the Airflow service. Once we define this tag, again, this tag is a pythonic kind of expression. This is configuration as code. You can specify your tag by listing out your task, describing those tasks in a pythonic kind of way, and also stitching all the dependencies together. Once you do that, you register your tag and your pipeline becomes operational. I’m going to quickly run this tag and really demonstrate how this pipeline is in effect. I hit run and I’m going to drill down into this graph view. So, it starts with the first task. As you can see, if you go to the QDS, again, this whole Airflow is integrated into Qubole data services. Once I click go to Qubole , it actually takes me to the command that particular task is running. 

This particular task is running this command. You can literally see this command running and it basically created an empty database. Now if you refresh, you’ll see that it’s moving on to the next task. This task is actually importing the database, importing the data from the Movie Lens website and curating that data in a hive table, in a couple of hive tables. This should take a moment and once this finishes, we are creating those high tables in these two tasks. And finally we’ll be running the notebook. I’m going to go back to the Analyze and watch the okay, it has finished. Now it has moved to the creation of those two tables. Again, clicking on the analyze command takes me to the command that is actively running. Same thing here. These two commands are running in parallel which is why you see those branches in the airflow. 

Once these finishes it starts running the machine Learning pipeline and then we’ll go into the machine Learning pipeline and see how it’s operating. Now it has finished creating the movies table and the Ratings table and we are now running the Spark notebook. This takes me back to the notebook that I was showing you in the beginning of my presentation. This particular notebook is currently being run. If I scroll through this, see it’s scrolling through the paragraphs it’s running right now. If I step back and start explaining this, what’s happening here is I’m loading the data that is curated into the hive table. Again, Ratings is the primary data here to be able to create that Recommendations engine use case. Once I load the data into the Ratings data frame, I am able to do some analytics. I did account on this, there are around 1 million ratings and around 6000 users in my data. 

Recommendation Model

I’m using that data to be able to recommend movies. I’m also loading the Movies data which is also a critical piece. It’s the metadata associated with the movies. With that I am ready to do the modeling. Before I start modeling, the first thing I have to do is I have to split my data into 80 20 splits. This data, the Ratings data, which is more than a million rows of data, it’s actually being split into 80% and 20%. The 80% gets used for training and the 20% offset will be used for testing or validating your model. We’ll use the training data set here to train the model. Again, we’ll use the ALS algorithm. Again, ALS is algorithm which realizes the collaborative filtering using matrix factorization that we discussed earlier. It really does that by solving a series of least square regression problems. Here all we need to specify to the ALS algorithm is the user ID column and the item column which is the movie and the rating, the value of the rating that we want to use for this. 

There are two kind of recommendation models. One is based on explicit ratings and other is based on implicit factors. For this demo I’m actually using the explicit because the rating is really explicit, it’s provided by the user. Implicit factor are really views, clicks, impressions, those are the implicit factors. I’m not going to demonstrate that in this demo. Once we define the model and call ALS fit on the training data set, the model gets trained. This model is trained on the 70% of those million records and then we are ready to test that model, validate that model. This particular paragraph is actually testing the model, the cold start strategy. There is a problem with collaborative filtering that a new user cannot be recommended movies because the user is not there in the matrix. We will set the cold start strategy to drop and then we will do the validation of this model. 

We validate the model, we’ll use a metric called RMSC root mean squared error which actually measures the error. The objective of validating the model is to minimize this error. Once we ensure that the error is in the acceptable range, we now move on to scoring the model. Here’s where we actually score the model. Now that we have a model which is curated, we want to recommend to all the users the top ten recommendations, right? If you are a Netflix user, you want to see your recommendations as soon as you log in, right? Here we are actually scoring that for all the users in the database, all those 6000 users in the database. We are writing that back into a table, into a hype table called recommendations. Once we do that, we are able to look at one particular user in context. I have picked the user 3000 here and I’m looking at the top ten recommendations. 

Again, these are the movies that this user 3000 has not watched and these are the top ten recommendations for that user. Again, this table that we are seeing here is available in Hi. If I go back to explore catalogs, which is part of Cubol, this is a data catalog capability that QBoL offers. This recommendations contains all the top ten recommendations for all the users in our database. Similarly, again, this data set is for a million records. I also have a tag which has 27 million records of data. So this is operational. This tag has been running, so it speaks to the scalability of this pipeline. I was able to go from 1 million to 27 million records, the same kind of tag. With that I’ll stop sharing and hand it over to Jord.


Thank you Pradib. That covers the demonstration portion and that’s basically the flow that he followed. He went pradeep went and grabbed data from a publicly available data set using airflow tags, move that data set into Hive structures and then leverage spark for machine learning along with other Qubole facilities such as notebooks. We told you were going to cover the architectures of a couple of our customers. Let’s start with lift. In the early days, this is what the Lift architecture used to look like. They were ingesting from a variety of data sources moving the data into cloud storage and then performing ETL into Redshift, amazon’s proprietary columnar database. They were using visualization tools and machine learning tools at the back end. Now they realized that they were not scaling their data. Demands were greater than what the infrastructure could support, and they started using Qubole. This is how they use Qubole today. 

They still ingest from a variety of data sources into cloud storage. They still use Redshift. In order to help them scale, they also use Hive, schedule and ad hoc queries. Presto schedule and ad hoc queries leverage our metadata store. They use the Airflow facility to orchestrate those pipelines, right? Now they have they’ve added some consumers to those data pipelines. And here’s a quote from their CTO. Essentially it’s saying that in less than a year, their data grew by a factor of 100. Qubole allows them to continue to scale quickly without necessarily increasing their infrastructure. Team some of the aspects we discussed earlier, another important customer for us is Ibore. Essentially, they help consumers get refunds by analyzing their receipts. Maybe there was a coupon that wasn’t applied. As you can see, they have a large number of data sources and they use all of the Qubolen engines that we discuss. 

Spark, hive and presto for different types of queries. That feeds into a number of data marks from which the information gets consumed. All of this is orchestrated using Airflow. These two are very typical architectures of how our customers leverage Qubole. The big thing for Ibota is obviously there are cost savings, but they are also able to create product and modify their product faster so that they can help their customers better. Here are a few of our customers, a bunch of these names you would recognize readily. Let’s talk about why customers choose Qubole. So three times faster time to value. Speaking about Ibada, right? So they can build product faster. It’s a single platform that gives you multiple engines, supports you in multiple clouds, and it’s a shared infrastructure for all users. Third aspect which is very important is the ability for one administrator to support multiple users. 

This allows you to scale faster and give access to the data to more users and do that more efficiently and faster. Then, of course, the ability to lower your cloud cost customer runs. One of the big things that impacts machine learning projects. 

Batch ETL

The best engine, from my perspective, for Batch ETL would be Hive. It’s very resilient, and it’s fairly predictable, large scale, distributed kind of engine. Spark is great for machine learning streaming applications, but if you try to do Batch ETL in Spark, you kind of accrue the technical debt. Again, it all depends on the skill sets that are available on the team and all those dynamics at play. 

Great. Thank you. I’ll pause to give people a chance to submit any other questions, as I do that on the screen. If you’re interested in getting a test run of Qubole, we do have test drive available, which is a 14 day free trial that we preload all the data sets and give you some sample use cases to run through. I would encourage you to check that out if you’re interested. You can also find that in the attachment and links section. There was one more question that just came in. Jorge, I think you might have covered this earlier, but can you use Presto for building pipelines? 

Sure. The short answer is yes, you can use it, but make sure that the queries are not long running queries, because presto has limited fault tolerance. I’ve heard of customers doing that, but they need to be short running queries. Okay, great. 

Build Data Pipelines

Yeah. So, as we wrap this up, I want you to walk away with these key messages. Number one, the platform that you use should allow you to run multiple engines. Right? So, again, building data pipelines is a lot like building a house, right. You will need more than one tool, you will need more than one engine to build those pipelines and then make sure that the platform allows you to scale. The way you scale is by providing you a high admin to user ratio. Right. Our customers experience a ratio. Of one admin to 50 users, from one admin to 100 users, and in the high end cases, one admin to 185 users. Right. Make sure that the platform you select is cloud native and takes advantage of the scalability of the cloud and it also gives you support to multiple clouds. And I think that’s it.