Video

×

Evan Harris – Data Platforms 2017

 

Evan Harris: Thanks everyone for coming. My name is Evan Harris. I’m a data scientist at Return Path. I’m going to talk about operating the full data science stack in the cloud and talk a lot about what that means. My big takeaway and the big thesis that I have, that I hope is what everyone can go home with, is that operating a data science team fully in the cloud, not only can make your team more efficient, more independent, and more powerful, but also empower individual data scientists on that team to really be technologists and to really embrace acting more like a software engineer and understanding the technology that they’re working in.

I’m going to, one, talk about who I am and where I work and what Return Path does, it’s a lesser known company, and also ultimately, talk about what I think a data scientist is, what I think a good data science team does, how the cloud can make that better, and finally, in the middle, we’ll just show an example in how working in the cloud makes it really doable. Return Path is an established company that mostly deals with email optimization, but I work on a smaller team within Return Path that focuses mostly on consumer purchase data.

What we ultimately do is is acquire lots of consumer purchase data from various sources. This is unstructured data. We package it up, normalize it, structure it, and ultimately, resell it. It looks something like this. You get high resolute item level consumer purchase data from a consumer panel that we have that shows the details of what they purchased from hundreds of vendors. This is extremely valuable to financial services, to market research firms that want to, in real-time, understand the consumer purchasing landscape, and what people are buying from where and how.

It’s a smaller part of what Return Path does, and I work on a data science team within that business unit. I want to talk about what we do as a data science team. We work on external features to this data product, which a good example is using models to infer the fulfillment method of some of these purchases. Looking at the unstructured data that we acquire, a person might have bought something, but did they buy it fully in store, did they buy it fully online, or did they pick it up in store. We can infer this with models using the unstructured data that we start with.

We also work on internal features, which a good example of that is classifying and organizing large amounts of really opaque unstructured data and knowing what to do with it. Things like that really help make our data pipeline more efficient, and it’s fun to work with that kind of unstructured data. Also, ad hoc analyses for prospects and clients, so showing prospects how to use a really high-resolution data product to better understand their use cases. I’ve introduced this notion of full stack data science in my title, the title of the talk.

In order to really define that, because it’s not a commonly used term, I’m going to start by showing what I think data science is. I was here talking to a few people in the last couple days. People were asking me– Last night, I sat with a bunch of engineers and we sat around and argued about what data science is. No one really seems to have a great definition. Unfortunately, my definition is that I don’t really have one. It’s definitely a cross-section of different disciplines. I can only really talk about my team’s flavor and my team’s brand of data science.

If you consider a pretty common Venn diagram like this one, you see data science is a cross-section of things like machine learning and natural language processing, basically building models that can really encompass what we call data engineering but also software engineering and DevOps. Also, in the bottom right corner, you’ve got business intelligence and data visualization, which is a lot of what data science is as well. My team, specifically, is right there, and really on the far edge of data engineering and machine learning.

We build out production data features, and we really serve product, which I think is different than some data science teams who may serve the internal intelligence of a large organization, like maybe building out reporting for executives or something like that. We build features that augment a data product. Talking about my team’s flavor of data science, that’s really where we are, and to define data science any further than that, it’s use case specific. Given that, what do I think a good data science team is doing?

To start, we have to decide what to build. There’s this phase of hypothesizing that can come in two ways. One is organic brainstorming, which is you can think about like data scientists who are really close to product analysts, or maybe are similar to product analysts themselves, coming up with ideas of something to build that might be useful to their clients. Also partnering with product teams. Product teams have a grand vision, but they don’t know much about technology or machine learning.

Data scientist will partner with product to decide the right things to build, not only the best product for the customer, but things that are feasible to build and things that match with the skill sets of the data science teams at hand. Testing boils down to, you’ve decided what to build, and now you think about testing. In some cases, people spend a lot of time at the whiteboard or reading papers or sketching out ideas. Personally, and with my team, I really like to build things.

I think rapid prototyping, sometimes just building a basic version of what you want to build to start and then iterating on that basic version. It’s really powerful, especially when you pair that with testing at scale. I’ve heard people in my own organization and other places talk about, “I have a model, it’s really great, it’s really accurate, but it doesn’t scale.” To me, it doesn’t make sense. If you use use cases for that model to scale, then you don’t have a good model, it doesn’t work. A model that is trivial compared to that model but scales really well is better because it it can scale.

When you pair rapid prototyping and testing at scale, you get really close to production. In this iteration cycle, rapidly iterating on what you’ve prototyped is really powerful when it’s at scale, because there’s not this massive productionization process. It’s ready to roll after you’ve hypothesized and tested. You’ve got something that does scale. That’s where I introduce the notion of the full stack. Some people talk about the full stack, meaning data scientists are more customer-facing. I see it as a little more further towards the back end in terms of– That’s true, too, and it’s good, but being customer-facing and dealing with data engineering in the cloud, it’s not really relevant to data platforms.

In terms of being full stack in your data science operation, what I really mean is a lot of data science teams will probably focus around this model’s phase. Some abstraction that is a data engineering team or a data pipeline team is building ETLs for them and offering them up pre-processed data, ready to build models. Their output is again dished off to other engineering teams that make data production ready after its passed through models.

The full stack, as I see it, is really data scientists who do build their own data pipelines, and in addition to that, are familiar with the storage end too. I don’t think a full stack data science team necessarily is building out the storage, but definitely being aware of it. Being aware of your partitioning, being aware of your file types, being aware of where all your data is, why it’s stored the way it is, is really powerful, as opposed to that, again, being an abstraction, because ultimately it allows you to do this rapid prototyping and testing at scale on an independent team all at once.

A lot of times when I’m getting started on a project, I don’t know what I really want to build yet. To put in a request to a data pipeline team, build me out this ETL to get me this pre-processed data before I start to build my model. Half the time it’s going to be wrong and we’re going to start over, and I have to put a request back in a queue. The ability to build out data pipelines for individual projects is really great for me and my team.

You also can go directly to business intelligence and data visualization in this framework, because you can either build ETL pipelines to go straight there, you can build BI tools right off of where your data is stored, and when you’ve built models and you can serve up those models yourselves and dish data back out into features, either out to customers there or even back into your own storage for internal use, you can then come back out and to visualize the output of your own models if they’re used for internal resources. It’s really how I view this full stack of data science and it’s really around taking a more engineering approach to building what you build and owning your own data pipelines.

It’s extremely useful to pull it out of abstraction and talk about an example at risk of spending a lot of time in a talk about technology, talking more about machine learning. I’m going to do it anyway because I think it’s really cool and I think it might be interesting to everyone an example of what my team does and what some of the machine learning is like. A common application for us; again, we sell this data product which is like a bunch of consumer purchase data, and hundreds of millions of products, millions of users, and hundreds of, even thousands of different vendors like Grubhub or Nordstrom.

Here’s an example where we have this really high-resolution data like the exact product title that a user purchase. For a lot of use cases, a tagging exercise is really useful. It can serve as a dimensionality reduction so that when one of our customers is trying to drive insights from this kind of data, grouping by these really high-resolution product titles might not really get them very far because they’re not really that normalized, and that high dimension is hard to deal with.

If we can do a tagging exercise where we take “Euro Delight Pizza” and tag it as ‘pizza’, or we take this “stripe crewneck t-shirt” and we tag it as a ‘t-shirt’, that can be really useful. The reason this is challenging with our data set is that it’s very scaled out vertically in terms of industries. So we not only have like food and clothing, but we also have airline purchases and shoes and electronics, and the massive of diversity among products that sit in our in the data set that we have is really big. Again, that dimensionality reduction is really powerful.

So can we do this? Yes, and the way it can be done with a kind of full stack data science team. To start, I have all these product titles sitting in S3 with Parquet formatted files, and what I can do is it’s been up an EC2 cluster, and using a Spark machine learning pipeline which I’m going to leave as an abstraction right now, but I’m going to elaborate on in a minute.

I can pick up that data, bring it on to the cluster, build out a machine learning pipeline that does all of my pre-processing and represents my ETL, and gets the data in the exact format I need to start doing some machine learning. I can embed a machine learning model into that pipeline, and after the model transforms the data, I can do my post-processing, I can stage it, and use do all this in Spark and ultimately stage it for output and toss it back into S3 in the same format, sitting right next to the data where it started.

When you sell a data product, that’s really it. I mean your product can be a new data table with a new feature that can map back to the data you’ve already sold to your customers.

What’s cool about this, too, is I can bring a visualization layer along with this entire process using Zeppelin. When I’m on my EC2 cluster, I can visualize every step of the pipeline as I go. My data while I’m pre-processing it, my data after my models transformed it. I can do some aggregations and do some testing on my model, and that’s really great to be able to do that visualization, more or less, in stream with this development process.

I want to elaborate on the abstraction of the spark machine learning layer that I had there. This is really cool algorithm called word2vec. It’s a shallow artificial neural network and what I can do is train this massive network on my hundreds of millions of diverse product titles. What the model actually does is it builds a vocabulary of all the unique terms in this corpus of product titles, and projects each word into a vector space where the words positioning in that vector space represents its linguistic context so what it looks like in your other words. Again, this giant vocabulary of terms in all of my product titles mapped onto a vector space where words in that vector space that are close together have similar linguistic context.

The word pizza might be in this vector space is going to exist near words like pepperoni, and maybe large or 12-inch or sausage, and words like t-shirt might exist near, maybe brands of t-shirts or maybe even close to shoes because they’re both clothing, but they’re not exactly the same.

What I can do after I train this word2vec model is transform incoming product titles that come into this model and projects that product title as a whole multiple words onto that vector space. So what I end up with a new incoming product title is a position in this vector space and based on where all the individual words in that product title existed in the vector space.

One idea in terms of tagging this thing is to write kind of this custom tagging extraction which would say, “Okay, I’ve got an incoming product title, it’s got a position in the vector space. What are some words near this product title in the vector space?” A simple thing to do is, say let’s take maybe a few hundred words that are close to this incoming title in the vector space, let’s see which one of them exists in the incoming title.

I kind of force inclusion. So in order to get tagged by something that word has to be in the title. I don’t totally like that but it’s a start. This is a rapid prototype. This is something being done really fast to kind of prove out a concept. The problem though is that this doesn’t scale. In order to take an incoming vector and transform it with word2vec and find all the words around it, takes too long especially with the Spark implementation. It’s not going to work out to bring in hundreds of millions of product titles one at a time, finding the words near it in the vector space. This doesn’t quite work for me.

What I can do is add in a new layer. This is kind of an internal dimensionality reduction within the problem I’m working. I can take all these transformed incoming product titles that I’ve transformed with word2vec and cluster them using K-Means, which is a really common unstructured classification algorithm used for clustering.

I can find about a hundred clusters of all my product titles. This is kind of arbitrary that I use a hundred but it seemed reasonable. Instead of one at a time taking every product title that comes in and finding all the words near it in this vector space, I can just do that with a hundred clusters and each of those clusters has a center somewhere in that vector space and I can take that cluster center, find all the words really similar to it, and when I have an incoming product title I can say, “Okay, which cluster do you belong to and of that cluster, what are words near you?” If there’s any word near you that exists in that product title, you get tagged with it. This is super fast. This does scale extremely well because you can pre-compute those cluster centers and the words that are near those cluster centers in your word2vec.

This is a kind of getting deeper into the machine learning, but it’s interesting to me, and it was a cool way to rapidly prototype this idea of using word2vec in order to tag our product titles.

The examples are cool with this tagging exercise. You take something like a buffalo chicken sandwich, and you realize the chicken and sandwich are the interesting words there and not buffalo. Again, these tags are really trying to get at the essence of what this product is and get rid of the noisy words and the qualifiers and the adjectives. So you get a word like shorts, you get boot and hiking out of a longer product title, and this last one I really like these are just some examples but you get that it’s a case, you get that its Apple, and you get that it’s an iPhone. So you get a brand, you get a make, and you get the case, and you know words like pink. Pink is going to be really common word, but it’s really far away from this Otterbox case in our vector space. It’s not going to be highly concentrated with these other words like Apple and iPhone in case.

It’s a cool example but moving on from there and to start talking about the cloud because I really haven’t much yet, and what is the subject to do with cloud? Why does the cloud make this more interesting? Why does the cloud make a project like this better?

Basically, I had another example that I want to use. It’s something we actually use in production, and I had slides for and everything (but actually got rid of it). I woke up one day I was like, “Oh, you know, I actually have this idea. I’ve been playing with word2vec. I think this project would be really cool for the stock I’m doing.” The reasoning was that I was like, “Okay, it would be really cool if I could provision all the hardware and all the software to do this project on the spot, and do that with no bottlenecks, without talking to anyone else entirely as a data scientist with the tools that I have.” That’s pretty doable given the tooling we’ve set up for ourselves.

Before I start using Spark, I tried on a single server to use a module called gensim, which is a popular Python library that has this word2vec implementation, and I can provision that on demand.

An EC2 instance with Amazon to do that explanation, and I can use a machine image that we have with all of these common Python data science libraries that everyone uses, and we have that say for our team to say, “I want EC2 instance on the spot. Install all the stuff on it.” I got on the road and decided I really want to try this in a distributed way, this is challenging on the single server.

I’ve got hundreds of millions of product titles to train this massive word2vec in all familiar K-Means clustering algorithm. This is challenging. I’m going to need some more power with which to do this. I decided, okay, Spark has word2vec implementation. Spark has a K-Means implementation. I like using Spark, and it’s pretty easy to spin up cluster to do these kinds of things. Again on the spot instance, I can provision myself a pretty small cost, but decent size cluster of EC2 instances using Qubole.

It can connect to my data store instantly because we’ve built that IAM roles with Qubole to allow that to happen automatically, and for me, I don’t have to care about that at all. I can pick exactly the Spark version I want which is great for me too because especially with the machine learning libraries in Spark, that’s rapid iteration, and if you’re stuck using one-point something right now with Spark, you’re missing out.

To complete that layer visualization that I was talking about with Zeppelin, Qubole has that built right on top of the cluster that I spin up with their UI. There are custom notebooks that involve their own flavor of the Zeppelin notebook. Also, my data IO is taken care of automatically. Qubole lets me talk to S3, and I don’t really have to do anything other than that. There’s no other data store involved in this process.

The reason I’m able to do this is because the tooling mainly but because of self-service data and self-service hardware, which I think are two of the big keys to getting that data science team or data scientist independent and not relying on ops teams or data engineers to get hardware, to build up pipelines. Out of the example and more back into some abstraction, you come up with a good idea like this, and when I wanted to build this project for this example and do it in a couple of sittings.

If we had on-premise hardware, it just wouldn’t have been possible. I’ve been where we’ve have on-premise hardware and the idea of getting 15 nodes to myself, getting the exact version of Spark that I want, and to not interfere with production jobs or ad-hoc workflows. If I didn’t like that number, the 15, it was an educated reason for picking that number when I started but it’s kind of arbitrary and I could end up needing a lot more.

I don’t really have a way to do that and a project like this, something I just want to play around with would be dead on arrival. I’d spent weeks trying to justify it with product managers and ops teams. The cloud could break through that wall and provision your own hardware and install whatever software you want, build out in on your own data pipelines and go straight to features because in this situation again we sell a data product.

If I can pick up a few hundred million product titles, tag them and drop that back off on S3, that’s really close to product. In our setup, that’s really awesome to be able to get the hardware taken care of without any external help. Doing this involves a pretty big paradigm shift, and I was talking to someone the other night and they were asking me why is the compute and storage split, so challenging for data scientist and analyst.

I didn’t have an awesome answer, but I thought about it for a while, and I really think it’s around where a lot of data scientists and analysts are coming from. If you have someone who has worked a lot of their lives, even working on their laptop or desktop doing their compute with Excel or with RStudio or something, and suddenly– Or maybe they’re using one home-based server that they’ve had for months that’s never going away.

Once you tell them, “Hey, actually you’re going to be using an arbitrary server that you might spin up and spin down at various times. Your data is going to live somewhere else so you need to talk an immutable object store to actually use any of your data.” Also, your data might sit in this data lake and it’s not totally structured. Every time you get a new computer, you’re going to need to install some stuff.

It’s very, very different when you’re working on a desktop or a laptop or a single remote computer especially when you don’t have a computer science background or you’re not a professional software developer. It really is a paradigm shift I think for software developers as well but definitely for data scientists. I think the paradigm shift for me is a learning curve but then you learn to love it.

It’s really awesome to be able to get the exact hardware you want, when you want it, and only for as long as you want it, to be able to turn off your hardware when you’re not using and have your data live persistently in a data lake. Also, to have unstructured or semi-structured data in a data lake where you can not even add schema to it and have Spark in for schema of some unstructured that’s sitting there.

It’s really, really powerful again doing all this in an independent way. Looking at our specific ecosystem of services and software, we’ve got our Amazon S3 data lake sitting in the middle here, and what’s cool is, we can do– Like in my example, when I wanted to use gensim on a single EC2 instance to prototype training of joint word2vec model, and I can do that simultaneously while having a Qubole cluster app running Spark during the exact same thing on top of the exact same data without interfering with each other and without interfering with anyone else with an EC2 instance that’s doing building models or doing ad-hoc data munging or visualization.

On the distributed side, what’s really amazing is that we can have BI analysts, data quality analysts using Presto and Hive to get really quick. Especially Presto is getting really quick responses from some of their data in an interactive way on a separate cluster for me, on Spark writing machine learning applications again on top of the exact same data.

On the subside, you can see also that having Presto or mainly Presto but also with Spark, you can get to the visualization playing Zeppelin through Qubole also.

On the single-server side, I talked a little about it before but we have machine images that we share as a team that allows us to have all of the standard Python libraries you’re using for machine learning, and Jupyter Notebooks for visualization. You could add R and things like that for some analysts as well.

Again, when you’re going to provision your own hardware, there’s no big apps barrier to getting the software you want installed. All of this comes really well packaged. There are kind of three prongs to really operating efficiently in this environment. One is tooling. So if I’m going to be spinning up clusters all the time, having something like Qubole to make it easier, it’s great we worked with EMR for a while and it was nice for a transient cluster, or when I’d know exactly how many nodes I need and I want to speed up on cluster, do a bunch of stuff, pull it down.

It doesn’t work well for the ad-hoc use case which you don’t want to get up in the morning and go on to EMR and say, “Okay, does anyone else have a cluster app? If they do, how could I share it with them? Am I going to have to go talk to them or something or just jump on it and butt heads with their jobs?”

Qubole takes care of that. We’ve talked about using machine images for a single-server computing. It makes it so, in certainty, analyst and data scientist who don’t really want to know much about installing software. You can pre-package a lot of that for them. In addition, on education standpoint, personally I have spent a lot of time while migrating to the cloud. Learning as much as I can about distributed computing, learning as much as I can about Spark and Spark internals. Also looking at the landscape of services that that are out there and offered and really just it kind of takes a personal commitment I think to really wanting to learn about distributed computing and operating in the cloud. In the end, nothing really beats more computer science education. Data scientists and analysts just learning about software development best practices, learning about ops is really useful to me and to our team. We finally experience and– Our team had a hard cut-off where while we were transitioning to the cloud at one point, we decided, okay, now all of our data is going to be an S3 and no one’s using on-premise computer anymore.

That was probably the best thing to happen to any of us in terms of learning, and we still had to support a product which was challenging but being forced in the deep end was the way to go for us. Sitting around for a year with maybe some option to use some cloud services can be useful for a lot of teams, but sometimes jumping right in and getting hands-on is really useful.

Then looking forward, I have a couple minutes I want to just talk about some stuff I’m looking forward to. This is thinking about everyone talking about serverless computing this morning. It hit me that one of the things I’ve been working on with my engineering teams gets into that realm a little bit. That’s a building out an internal API, that’s built and containerized by data science teams, and to to offer up that container to our engineering teams which they can distribute how they want to basically ask us for model responses that sit behind this API.

Our engineers don’t have to care about anything about our models because it’s all containerized, and this works really well in a streaming environment when maybe you have one type of input, maybe it’s tweets or Facebook messages or Yelp reviews constantly coming at you, and you have a single point of input and stream. Building on an API that has a bunch of models behind it, an arbitrary number of models that are going to tell you a bunch of things about this input, and dish that response back to engineers. It works really well, it’s a cool way to kind of allow data scientists to to build out something containerized, importable, that is pretty easy to dish out to engineering teams.

I am working on that and it’s really cool, and it’s fun to write an API if you haven’t done a lot of software development in your life. Yes, the last thing I just think that again my point that I think data science teams can really be empowered by self-service data and self-service hardware to be independent and be more productive for their organizations and really become technologists themselves.

Thanks. Please fill up the survey if you can, it will be nice and I’ll take any questions if you have them.

Speaker 2: Evan, thanks. This was great. I’m just curious how long it took you with the works to that project? It sound like it was pretty new, and I’m not familiar with that library. So it was interesting, but just curious like kind of start and how long it took you to do that.

Evan: Yes, for a simple prototype, like not really any time, like a couple of sittings. Admittedly, I’ve worked with it a lot before. I didn’t have any code that I built out to make it easier to work with, but I know the library really well, and I’ve spent a lot of time with Spark’s machine learning. If you’re coming at this from afar, you don’t know anything about the algorithm, and you don’t know anything about the implementation of it, it’s going to take you a long time but in all honestly, I was doing something I’d done before.

It made a lot quicker but for that project, that’s a couple of sittings something like 10 hours of total work. That ability to do it quickly is sitting on top of months of studying the algorithm and working with similar libraries. The point being that I didn’t wait for any hardware like I didn’t wait for anyone to pre-process my data for me, that makes that cycle really fast. Yes.

Speaker 3: Yes, the various circles, and you were on the graph on the left side, kind of in between, I think machine learning and what was the other one? BI, machine learning.

Evan: Data engineering?

Speaker 3: Data engineering. On the BI side, I was curious who the various users are and how they’re accessing the data that you’re delivering.

Evan: Yes, totally. BI users, it’s mainly like a data quality thing since we are selling a data product and kind of understanding the breadth of our data and various aspects about it because that help better helps us sell it and market it and understand our customer use cases. Again, we store everything in S3. There are BI team and BI/data quality team is reading on dedicated Qubole clusters reading that data from S3 with Presto. They have access to use Hive too or Spark, and we store it all in parquet format so it’s geared towards Spark but fine with Presto too. That’s what they’re using.

We found that Qubole Notebooks are great in terms of sharing too, and even with no computer science background being able to build a little widget so that once you’ve aggregated your data that’s distributed, reduced it to fit on the master node, you can line chart, pie graph, table, doesn’t matter, and they can take screenshots or kind of pop out a chart and send it around. Yes, that’s where BI team is at.

Speaker 4: Have you used Qubole since the beginning?

Evan: No. We started our cloud transition without Qubole. Basically, we’re doing this, so that graph I had with our software and services ecosystem. You could put EMR right where Qubole was, Amazon Elastic MapReduce, their EC2 Cluster Service. We were using that which you can get Presto, you can get Hive, you can get Spark, you can click a checkbox, and get Zeppelin as well but organizing it all was a next step you had to take. It wasn’t all packaged together in this UI to say, “Okay, I spun up my cluster now. Let me see the UI next to like a query history or some like Qubole has, it was like okay, this cluster is up.”

It’s on a new IP and Amazon’s going to give me a link to my Zeppelin Notebook that I need to put forward to get access to because of security things. It’s a lot of jumps to make and we got to where we were building out the tooling ourselves to semi-replicate what Qubole has done, and we got to a decent spot, and then we’re introduced to Qubole, and sad this is really what we’re looking for mainly in integrating Presto, Spark, Hive, and a visualization layer.

Speaker 5: What are the biggest things you notice after implementing Qubole? Biggest change it has made for your teams.

Evan: Yes, on the sharing of data for the ad-hoc use case, it’s mostly around one auto-scaling clusters is great. We spend a lot of time. We would have a one of our Ops people be an administrator, and we’d spend a lot of time back-and-forthing with them just to get one more node. It was a whole ordeal. We even spent time running things like Presto, Hive, and Spark on the same cluster which is a big mistake. Then when Qubole came in and said, “Hey look, it’s been a bit every cluster for each one. You can have a cluster configuration that’s nice and saved with a name slapped on it that says ad hoc and those individuals know exactly where to go. They’re not dealing with some new IP address every time. The UI layer is really helpful to non-technical or less technical BI users, which was a big help for us because again software engineers and DevOps people can forget that data scientists and and data analysts don’t always speak that language. The package together – UI along with the auto-scaling were the biggest benefits.

Speaker 6: I have a question. Can you speak a little bit about how do you deploy the models that you guys build?

Evan: Yes, totally. Some models are now deploying with that API I was talking about at the end. That’s really out of this self-service cluster computing situation. We’re doing some of that for in-stream stuff, but for this batch stuff, for deploying right now I use Qubole’s API. I can ultimately do what I was working in the notebook to build that word word2vec model. Once I decide I like it, I’ve pretty much written all the code for a nice pipeline. I’ll compartmentalize all of my code, which is pretty generalizable, and ultimately, I’ll rip all that outright one Spark application that I can run that from my own machine outside of the Qubole infrastructure. This can talk to Qubole’s API, and say, “Hey, you know, here’s all this code, go spin up a cluster, put the code on all the different nodes at the cluster, and then run this machine learning pipeline and drop the data off somewhere.” A lot of what we’re doing is batch processing, so that process works really well for that especially because our endpoints start in S3, end in S3. We really a scalable thing and there’s just this one layer in the middle of a spark application that needs to run. That’s how we’re deploying.

Speaker 2: I have been on YouTube and I think I saw a talk from you on Presto.

Evan: Yes.

Speaker 2: I think it’d be just interesting for the people in the room. Everybody’s from different walks of where they are and their technology and whether you’re on-prem or in the cloud, I think a tool like Presto is pretty transformative in like a totally enlightening experience when you’re coming from an old-world technology especially like around BI stuff. So if you could maybe just shed some experience that you’ve had like what it was like being on-prem and whatever BI tools you were using during that time, and then what that’s like now using stuff like Presto in the cloud.

Evan: Yes, totally. Presto is super fast, and that’s why I like it. and it’s an sequel I’m pretty much I’m pretty sure it’s r pretty easy for anyone who writes SQL to adopt to the query language. In our on-prem world, we were running an old version of Hive on a single, big cluster where we couldn’t scale up or down, and we couldn’t pick different versions of software. Running Hive for, let’s say like one table, you’d only need to join but you want to just group and and aggregate a small single partition of that.

You’ve got a lot of overhead for your Hive job and it’s gonna take a while. Going from that, to running those same queries with presto or orders of magnitude faster which is great for the interactive use case where you’re not planning off this one query to run for three hours, and thinking really hard about the query before you write. You’re just saying like, “Okay, I’ve got this one idea. I want an answer, and that answer is going to inform my next query. So this is really interactive way to do things. Presto was great for that, super fast, but ultimately just less fault tolerant. That’s why it doesn’t get use as much for production stuff.