Fast and Cost Effective Machine Learning Deployment with S3, Qubole, and Spark


Female Speaker: Ladies and gentlemen, thank you for standing by. During the presentation, all participants will be in a listen-only mode. Should you wish to ask a question during the presentation, please use the chat feature located in the lower left-hand corner of your screen.

As a reminder, this conference is being recorded and will be made available on Qubole’s website shortly after the presentation. I would like to now turn the conference over to Alex Aidun, Training Engineer at Qubole. Alex, please go ahead.

Alex Aidun: Hi, everyone. Thanks for joining today’s webinar. We are delighted to share with you a use case of MarketShare, you’ll hear how they store, access and process vast volumes of data using Qubole, Spark, and S3. First, we’ll hear briefly from AWS about improvements to S3. As a reminder, please feel free to type any questions into the chat box, we’ll address them at the end of the presentation.

I’d now like to hand things over to Brandon Chavis, Solutions Architect with AWS.

Brandon Chavis: Thanks, Alex. Hi, everyone. I’m Brandon, I’m here to talk to you about Amazon S3. Real quick in case you’re not familiar with S3, let’s jump into a little primer in case you’re completely unfamiliar. Amazon S3 stands for Amazon Simple Storage Service. S3 provides developers and IT teams with cloud storage that’s secure, durable, and highly scalable. S3 is actually object storage, so that means it has a simple web interface that allows you to store and retrieve any amount of data from anywhere on the web.

A critical part about S3 is that you only pay for the storage you actually use and there’s no minimum fee and no set-up cost. That’s important to realize because Amazon S3 eliminates the need for capacity planning, so it allows you to scale on demand and really only pay for the capacity you actually need at any given time. Also, if your data ages, Amazon S3 takes care of automatically and transparently migrating your data to new hardware as hardware fails or reaches end-of-life behind-the-scene. It’s extremely durable and it’s built around redundancy. It’s designed for 11 9s of durability, which means that it’s highly unlikely that you will lose any data. S3 also provides several mechanisms to control and monitor who can access your data as well as how, when, and where they can access it. We’ve made huge investments into making Amazon S3 very durable, very redundant, and very secure.

S3 can be the primary and permanent copy for all of your data because of the durability, scalability, and availability characteristics that it has. S3 commonly serves as a foundational component in AWS architectures and many, many workflows. It offers a variety of storage options to support different workload characteristics without you having to purchase any hardware or sign any contract. You can simply transition your data to different storage classes based on the access patterns for your data as you see fit. It’s definitely worth mentioning that AWS is a very large ecosystem of partners that builds solutions around S3 and that kind of supports the notion that it’s definitely enterprise-grade and it’s the perfect fit for a variety of workloads and applications, the analytics or big data, or media, or high compliance workloads.

Let’s dig into the idea of storage classes on Amazon S3 just a little bit. When you think about the typical life cycle of data, it’s generally the newly created data that’s accessed very frequently, and that frequency of access pretty much tapers off during the lifespan of the data. We’ve introduced a new storage class recently called Standard-Infrequent Access and this new storage class offers the same great durability and performance of Amazon S3 with slightly lower availability. It’s ideal for workloads that are colder and less frequently accessed, hence the name.

If you don’t want to think about your data access patterns, but you just want high durability, availability, and performance for Amazon S3, you can simply just select S3 Standard. Again, if you have that less frequently accessed data, you can leverage S3 Standard infrequently accessed to save on cost while still benefiting from that durability and performance as S3 Standard.

Then, at some point in time in the future, your data will be ready to be archived because no one’s really interacting with that data actively. You need to just archive it basically for record-keeping or compliance reasons, something like that. That’s where Amazon Glacier comes in. It’s really set up for data you need to keep, but you don’t need to interact with very frequently.

Common use cases for Glacier really include things like logs or anything that you need to just make sure you keep and I think compliance is one of the number one reasons for keeping data long-term in Amazon Glacier. Amazon Glacier can really save you a lot of money if it comes down to keeping things for a long time and not interacting with them frequently.

That’s a quick primer. We want to talk about what we’ve done so far in the past couple of years of Amazon S3. In general, we’ve had a very busy year across the AWS platform. We launched 722 new features and services in 2015, and that’s nearly a 40% improvement year-over-year. As of March 1st in 2016, we’ve released 106 new features and services and we have eight significant feature releases just as it applies to S3. These are releases in addition to our continued focus on core fundamentals like high-security, durability, and then availability and performance.

S3 is really growing continuously, we regularly peak at millions of requests per second and have trillions of objects. Given our scale, one of the things we think about is, how can we help our customers manage the billions and billions of objects they have on S3? How do we really help our customers get more out of the data on Amazon S3? We really strive to provide the right tools and capabilities to help you get the most out of your data.

Digging in a little bit into some of the specific releases we’ve had, one of them is event notifications. This is the ability to trigger notifications when new objects are added to an Amazon S3 bucket. This can do things like initiate processing on the objects as they arrive, you can capture information about the objects and log it for tracking or security purposes. You can also use this to add logic within your application or use something like SNS, which is for push notifications or email or mobile alerts, or SQS for triggering workflows that pull from a queue. Or AWS Lambda, which lets you run code in the cloud without managing or provisioning servers, where you basically only pay for the compute time you consume.

S3 event notifications really open up a lot of new workflows that previously didn’t really exist because you can just react to changes, you don’t have to pull, there’s no fleets of workers to manage, notifications really make it simple and you can focus on your how your application reacts. You can also architect applications that have new functionality just driven by these events. It’s also very fast, so if you need to process data very quickly when new objects arrive, on average, these notifications are sent out in less than a second.

Third, we think this really helps integrate S3 with the rest of the AWS platform, it really helps to emphasize the concept of event-based workflows. Again, you can architect applications in ways where you no longer have to pull for certain things happening, you can just trigger workflows based on when new data is added to S3.

Next is cross-region replication. This makes it possible for you to keep your data hundreds of miles apart for maybe compliance or regulatory reasons or maybe to move your data closer to your end users. Cross-region replication automatically replicates every object uploaded to a particular S3 bucket to a designated destination bucket in a different AWS region. Even though AWS S3 provides 11 9s of durability out of a single region, some of our customers were basically asking us to help them automate replication of objects between regions, to help them achieve their compliance objectives, lower latency for their customers, or to enhance security.

This is also good for low-latency requirements. If you have a use case where the volume of data delivered isn’t really high enough to benefit from a CDN, like CloudFront, perhaps maybe replicating data closer to your end users can improve the experience and give them a lower latency connection to your data in S3. Again, for compliance workloads, some of our customers have told us that replicating data across different regions helps to meet internal compliance and best practice guidelines by storing that data a certain distance apart from the original copy.

Another new feature is VPC endpoints. In the past, S3 was only available via public endpoints, which means that your instances, for example, needed internet access so it could reach the public S3 endpoint. However, now with VPC endpoints, you don’t really have to use the internet gateway or manage a net instance to establish connectivity from within your Amazon VPC to Amazon S3. Your instances can have a direct private connection to your S3 bucket. This is an easy to configure, reliable, and secure connection to S3 that allows EC2 instances running in private subnets and have controlled access to S3 buckets, objects, and API functions the same way they would before, but now without requiring that internet access.

Another key component of managing your data is understanding what data you have and how is that data being used. We’ve introduced new storage metrics for Amazon CloudWatch.

These new metrics help you understand how your usage of S3 changes over time. CloudWatch helps you set up alarms on these metrics and you can get alerts as these usage patterns change. We also introduced the ability to track API calls made to your S3 buckets using AWS CloudTrail, and you can use CloudTrail logs to demonstrate compliance and improve the security of your S3 buckets.

Then finally, we wanted to talk about Amazon Snowball, which is a 50 TB storage device that allows you to literally ship your data to AWS for import into S3. This Snowball appliance is really purpose-built for efficient data storage and transfer. It’s rugged enough to stand six Gs of impact, and at 50 pounds, it’s light enough for one person to carry. It’s entirely self-contained. It has power and ethernet in the back, and then an E-ink display on the front. It’s weather resistant and serves as its own shipping container. This allows you, if you don’t have the ability to transfer large amounts of data over your internet connection up to S3, you can simply get this sent to you, upload your data to the Snowball, send it back to AWS and we’ll upload everything to S3 for you.

Really, we want to give customers options and help them get the most out of their data on S3. Hopefully, we’ve given you a quick overview of some of the features that we’ve been introducing to help our customers do this. We encourage you to take a look at S3 for a great option for storing all of your data in the cloud. If you’d like to learn some more, you can take a look at this slide after you receive this presentation after the webinar. Take a look at some of our architechture guidelines and also take a look at some of our content in terms of presentations, other webinars, and blogs.

I’d like to pass it over to Piero with MarketShare. That’s it for me. Thank you very much.

Piero Cinquegrana: Hey. Thank you, Brandon. Very interesting presentation. I learned a lot of things that I myself didn’t know, so thanks very much for that. Today, I’d like to focus on a use case on how to use S3 and Qubole to store and access and to pass data processing in a machine learning pipeline. First of all, my name is Piero. I’m a data scientist at MarketShare, and I’d like to give you a little bit of an overview of what MarketShare does. MarketShare is a marketing analytics outfit. It was founded in 2005.

Our core offering is MarketShare DecisionCloud. It’s a platform for attribution, planning, and allocation, benchmarking, and pricing. We have a variety of clients across different verticals, and we work with many of the Fortune 50 companies. Recently, it has been acquired by Neustar in November 2015. This is how the MarketShare DecisionCloud looks. It’s composed of different apps. The difference between those apps is the level at which the insights are generated and the allocation happens. We go all the way from quarterly and yearly planning for marketing all the way to very tactical weekly decisions for TV and visual allocation.

Our bread and butter is using machine learning pipelines to help companies makes decisions, so we ingest bus volumes of data coming from clients or from third parties, and we build generic or bespoke models in our workforce and then we provide reports and planning and allocations in the outputs. Our clients actually log in on our DecisionCloud and they are able to either see the reports or do wargaming with the models that we built and using the data that we provide in the web script from the internet. That’s MarketShare.

For those who are not that familiar with marketing, marketing has been evolving rapidly since the ’80s. In the ’80s, it used to be a much simpler signal to be a marketer. You only have a few TV channels, you only had cable, and radio, and newspapers, and magazines, and maybe cinema, but then very few outlets on which to promote and advertise your brand. Fast forward 30 years, you have a plethora of channels and media outlets that marketers are increasingly having trouble allocate across so many different avenues.

They are increasingly asking us, “How can we optimize and allocate to reach our customer and understand what is the customer journey that brings those customers to purchase the products?” With all of this change in marketing, it definitely comes an increasing complexity and volume of data. Our clients and ourselves, we come to grapple with many, many, many more data streams and many data points to deal with, and our models are becoming increasingly more complex.

When we deal with our customer deployment for the DecisionCloud, we have disparate sources of marketing data. Things like event-level webblogs, post-logs for TV buys who are coming for cable TV, where timestamps and networks of each individual ad buys are listed, the ad server logs where all of the online display ads and paid search ads are listed at the event level. A series of marketing events like PR events or other live events that the brand promotes. A time series or event-level log of online and offline sales, as well as some other tools like surveys or other webscraping of blogs and third-party sources to evaluate the brands and the products that are being sold.

MarketShare has to deal with this amount of data and it has to process it on a weekly or daily basis and ingest it in our machine learning workflow, and then provide refresher models or insights and then provide the outputs to our customers. This is a very difficult task, and we’ve come to the realization that we really want to focus our core expertise, which is the building of the machine learning pipeline, and we kind of want to outsource the function of managing the storage and the cloud service to a third party.

We actually store our data very cheaply and reliably, as Brandon mentioned, on S3, and use Qubole to bring up a cluster when we’re working with cloud-based cluster to copy it on to HDFS and AWS instance as needed, and bring it up and down as needed for our deployments or for our processing of the data. We’ve started experimenting with Spark in the last couple of years. We found that to be a great solution for both our data processing and our machine learning pipeline, as well as writing back the output to a database. I’ll go into a little bit of a detail later.

Essentially, we ingest the data, process it, put it into a data frame, build the model, and then the model provides some outputs and those outputs then are reported back to our customers by reporting and app layer. Basically, a front-end website where customers can log in and look at the reports that our machine learning outputs produce, and then navigate or slice and dice those reports, and also are able to run simulations and optimizations using the models that we estimated using their data.

This is all done essentially using third-party cloud-based services like AWS, S3, and Qubole. If we were to internalize these functions, we would have to manage a large department of engineers, of cloud engineers, that would have to set up our cloud storage and instances. It would probably be much more costly at our scale than using a third party like Amazon and Qubole.

Let me move on to the next slide here. This is what I was mentioning before. Our pipeline is divided into having a Data Lake, where all the data is stored in a single place, and we can access it and process it, and build our models on. If we have the data in a single place, and S3 is definitely an example where we can store the data cheaply in a single environment and have fast access to it. Then, we have a learning system. A learning system is composed of an estimation engine. It could be a linear regression, it could be a neural network, whatever model you’re using, this is a generic term. Then, it’s an attribution. For us, it’s attributing what happened in the past. If I did create an event X, Y, and Z, how much of my sales can I attribute back to those events? That’s basically what attribution is, and that’s our core expertise.

Then, the scoring of the models or the prediction, the backward and forward-looking prediction, and the simulation and optimization environment that we provide in their application. Once you build that model, you’re able to move to different layers within the model, as well as to ask the models to provide an optimal allocation of all those different marketing channels that I showed before that are present in the model. Say, if you historically spent 30% of your budget on TV, 10% on paid search and so forth and so on, then the optimal allocation might say, “Actually, you should be spending 10% on TV and 30% on paid search, because this is what the historical return has shown to be optimal.”

Then, the reporting engine is just, again, the way you report the results and the application layer is just the web interface to interact with the model and the insights at, basically, the DecisionCloud that I showed you before.

To go a little bit more in detail, I have several examples here of different inputs and type of inputs that we’ve encountered in our work. It could be a text file, it could be a hive table, it could be a JSON file or an HTML page. We do a series of transformations, such as aggregation joins or feature engineering, and we build data frames out of these inputs. Then, we combine those data frames into our design matrix, we do our unsecured SaaS estimation, scoring, attribution, and optimizations. Those provide the model outputs. Then, the application, the MarketShare DecisionCloud reports on these model outputs, like an attribution report, or the raw data themselves like I’m showing here, in order to see what historically the brand has done, what kind of spend have they done and what is the historical allocation.

In essence, working with Spark has allowed us to handle very disparate sources. It has a very diverse set of connectors, like the ability to import text directly into Spark, JSON file, work with hive tables or Cassandra database. It provides both an ETL and a machine learning capability under a single roof, meaning before, you may have to do some processing in a hive or some other database SQL language, and then move on to R or Python to do your machine-learning pipeline. Then, you have to have a connector in Java or some other scripting language that would connect the ETL part to the machine-learning part. Then, something else writing the output.

Spark really brings everything together under a single roof. You’re able to do reading data, doing machine learning, writing your own estimators and custom libraries, and then write back to a database. Along with that, the performance is much better than, at least from what we’ve seen, to some previous tool that were Hadoop-based that we’ve been seeing. We’ve seen something like 2x performance improvement compared to before.

To summarize, marketing has a lot more inputs than before. We’ve been trying to bring all this together in a single roof, storing this data into a single storage unit like S3, using Qubole to bring up the instance on the fly and Spark as a technology. Using Qubole has allowed us to reduce costs, increase the performance and focus on our core expertise, which is marketing analytics.

Now, I will address the questions in the Q&A. I see some questions here in the chat box. I will address the questions in Q&A. I will let Alex go through with his demo, and then we can move on to the questions.

Alex: Thanks, Piero. Sounds like a plan. My name is Alexander Aidun. I am a training engineer here at Qubole. I am going to speak a little bit about the Spark solution, and then talk about a Qubole feature called Notebooks, which will show you how you can do machine learning examples or analysis directly inside Qubole against data in S3. Rounding up the discussion that’s been presented thus far with a little bit of context and some demonstrations.

Spark is really a vast toolkit providing analysts, and engineers, and developers the ability to write queries or commands or workloads in many different languages or scripting methodologies. There are also many components to the Spark Core, which give, again, users and developers the out-of-the-box ability to leverage sophisticated algorithms or other types of features that are demanded by more recent data processing use cases.

Another valuable part of Spark is the ability to run Spark on a variety of different clusters. These different advantages that Spark offers, the language support, the components of the Spark Core, such as data streaming, machine learning, interactive analysis, and graph analysis, combined with the ability to interact with different distributed file systems, makes Spark a very appealing example, a very appealing option to people. As we go through the example, I will demonstrate how we can very easily connect to S3 and leverage the components of the Spark Core.

The machine learning library that I’m going to discuss and that is often leveraged in Spark is the MLlib library. This is an integrated framework for performing advanced analytics. It’s available out-of-the-box, it’s very easy to use. With very minimal configuration, you can produce powerful models and subsequently use those models to do predictive analysis or analytics. In Qubole, we have a feature called Notebooks. Notebooks allows you to step through queries and workloads in an environment that supports visualizations and also supports multiple languages. This is done through a feature called interpreters. These interpreters respond to different inputs provided by the users which are syntactically different and it will process these inside of the Spark cluster that’s brought up inside of AWS, which is pointing to the data inside of S3.

We’re going to use the Notebooks to step through our example. Within the example, I’m going to use both Scala code and SQL code and I’m going to use the machine learning library to create a model and then generate predictions. At this point, I’m going to share my screen. On my screen is the Qubole interface and I have opened a Spark Notebook and I’m going to walk through some of the commands here and talk through what’s happening to provide some context to the discussion that Piero introduced with regards to Spark and machine learning and Qubole.

This first line here, and I put titles on these, you’ll see that I’m creating a ratings variable by pulling data from S3. This is going to create an RDD with the name Ratings Raw inside of my star cluster. This command here is Scala and as a point of context, you could do this in High Spark, for example. The next step here, I’m going to do some formatting manipulations, remove my header and create a new RDD just called Ratings where the header is removed from the file. Now, this is happening inside of my star cluster since I’ve pulled the data out of S3.

Then, I’m going to perform a map transformation against my RDD called Ratings and I’m going to convert this from a set of strings to a set of integers and doubles. This will also create another RDD. This is all happening within the context of my Spark Notebook. I can step through these one by one, and I’ve already done that prior to our demonstration for time management purposes, but you can see the outputs here indicating that I’m generating new RDDs and of the type MapPartitionsRDD, which indicates that this is partitioned across the nodes in the cluster.

The next step here, because I want to do an example with the machine learning library, is to take my movies data and to split it 80% and 20% into two different arrays. I want 80% of the movies data in one array and 80% of the movies data in another. What this allows me to do is still within the cluster create two new RDDs, one called Training and one called Test. Now, I’ve taken my movies data and I have done some formatting to it, all within the context with Notebook and I’ve done these step-by-step so I can make sure I’m successful along the way. Then, I am creating two separate data sets.

The next step is to import the Spark recommendation library to my Notebook and then generate a model using the training data set and then evaluate the model using the test data set. The next line here imports Spark recommendation library, is just taking the available Apache Spark MLlib recommendation library and brings it into my Notebook. The ALS here is where average least-squares. There is additional documentation on this if you want to search the web. It will bring up a lot more detail about how this is designed and the code that supports it.

Not going to go into that right now, but if you are curious and you have some time during lunch, I recommend taking a look. The next step here is to generate a model using the ALS recommendation engine available from the Spark MLlib libraries that I brought in to my Notebook. There are some parameters here which, again, are more details provided surrounding these parameters in the documentation online. I’m going to produce a matrix factorization model. This matrix factorization model, I can now call using dot notation, so model.predict.

I can take my test data and I can within that call, structure it such that the model can read it and then create a RDD which is a set of predictions for different users and movies. As I mentioned previously, we can save data to S3 and I can even register this data as a temporary data set inside of my Spark cluster. Once that’s registered, I can take the movies data from S3 and do the same thing. Now, I have movies and I have ratings and I can also register the temporary data.

Once I have my predictions registered and I have my movies registered and I have my test data registered, I can start to run SQL queries within my Notebook, producing some analysis on the model’s usability. This query I have here will produce the set of worst predictions and this little bar graph here that’s very easy to create inside of the Notebook gives me a very good idea of how far off some of these actually are.

Alternatively, I could create a line graph here and just highlight particular recommendations for a single user, what I’m doing in this query, and we can use this line graph here to analyze the difference between our test rating and our prediction rating for various movies for this individual user. The Spark Notebook provides the ability to perform step-by-step analysis and model creation using components of the Spark Core and leveraging the features that both Brandon and Piero discussed.

At this point, we are going to move to our Q&A section. As Piero acknowledged, we have some great questions coming in. As a reminder, you can submit questions using the chat box on the left of your screen. If we do not get to your question, we’ll be sure to follow up with you. Our first question, and I believe this is for Piero, what are they reporting and app layers you are referencing in your last bullet point? Is this from within Qubole?

Piero: Yes, the reporting in app layers, we use our own custom reporting and we build our own apps to report on the outputs, i.e. within the analytics team, we use some of those reports in the Spark Notebooks just to do some model validation like Alex was showing. For our clients, we actually build our own application. I see another question about whether we’re running a Spark standalone or Hadoop cluster. We use Hadoop for the file system and then running Spark on top of that. I believe it’s the default configuration of Qubole. Is that right, Alex?

Alex: Yes. it’s a little bit more evolved in that, but we’ve integrated Spark with YARN in our clusters that are spun up inside of our client’s EC2 spaces. That allows Spark to take advantage of several optimizations and features that we have built in. Yes, in short, that is correct.

Piero: Okay. The other question is, in the architecture, which component capability is from Qubole? Well, S3 is the storage. They’re just literally like it’s a storage unit. Qubole provides the services to spin up the clusters, the AWS instance and buy nodes from AWS and then move the file and the configurations on the fly as the node is spun up and then provide the interface to interact with the instance. We don’t have to manage the configuration and the scaling up or down of the nodes as needed. That’s what Qubole provides. Is that a fair representation?

Alex: Yes, really well said. I think the next one is probably for me. How does Qubole support custom configurations for Spark clusters? Through a variety of methods, you can configure different instance types for your clusters, you can introduce bootstrapping through Qubole for your star clusters. We give you the ability to manipulate the size of executors, the number of cores executors use, how many tasks or how many CPUs are used for a task in Spark. You can manage overhead settings. There is a lot of support within Qubole for a custom Spark configuration. Let me know if does not answer your question, please.

Next question is the processing work being done on the client on AWS. I believe I can answer that one. The processing work is being done on AWS. That’s one of the major benefits of working with this configuration. The processing takes place in AWS, the storage is aimed at AWS and the communication between the two happens in AWS. Qubole is serving as the conductor of the orchestra that we are describing. All of that is retained within our customer’s spaces. All that processing is happening inside of AWS.

How does Qubole differ from Databricks and Advantage? That is a much lengthier discussion which I will encourage you to follow up on our blog and with our sales team. They can take that one with you offline. How does Spark’s performance with regards to hundreds of thousands of features in machine learning? I think I’m reading that right. Not sure I understand the question. Piero or Brandon, if you make thoughts on this one.

Piero: Well, I think he’s referring to their large data frames where with hundreds of thousands of columns and how does that within of Spark perform with it. Again, I haven’t worked with hundreds of thousands of columns, but I’ve seen presentations on working with Spark with that kind of feature set where we’re using it for image recognition, where you’re using a pixel per column, very sparse kind of data frames. I believe that it’s been done. I wouldn’t know the performance, but you can look at the AMPLab. I believe it’s Berkeley AMPLab, is where the Spark project got spun off out of. I think you may be able to find more detail on it.

Alex: There’s a question that’s hanging out in the questions here. I can’t tell if it’s been answered or not. I just want to address it. Is the data moved from S3 to HDFS for processing or is it processed in place as it remains on S3? The data does get processed inside of the cluster, but Qubole will process the result set inside of S3. In terms of what remains on S3, the original data set and the result set will stay on S3. The data is processed inside of the cluster. Please let me know if that does not answer your question.

It looks like we have no more questions in the chat box. If there are no more coming in, then I just want to say thank you, everybody. We can conclude here. Thank you to MarketShare, our customer and AWS, our partner for joining this webinar. We’ll be posting the recording to the Qubole website and sending an alert to all registrants via email when it is available. You may now disconnect.