Sean Downes – Data Platforms 2017
Sean Downes: I feel like I’ve already coalesced on what some of the really interesting problems are going to be. We can talk all about that later, but for now, I’ve been asked to give you some insight into what we’re doing at Expedia. I know this is a Tech Talk, but I pitched this talk at the level of somebody who maybe has some authority in the company, maybe you’re interested in moving to the Cloud, so maybe a lot of these details to my fiducial listener is not quite tech-savvy yet, but I’m all about the tech stuff.
If you want to ask very specific questions you want to drill down deep, feel free, we’ll get to it. Let’s set up the problem, the user story for someone coming into this conference or to a cloud-based system. You’ve been asked to bring your infrastructure from your company up into the cloud, and you’ve been promised all kinds of really great things. For example, let’s get rid of the data silos. Let’s give everyone access to the individual data sets. Let’s democratize the machine learning that goes with it and machine learning keeps coming up over and over again.
There’s lots of strong motivation to get out there. It’s understandable for machine learning to get yourself on the cloud. Maybe, like me, you were brought in halfway, the process has already started, so there’s another possible use case for those of you out here, where your data lake there’s something that has been promised. This is actually more of a data swamp, it’s a mess, it’s impenetrable, you have no idea what’s going on. You’re here for help.
I’m not sure that what I have to say will help you. I sure hope it does, but I think in aggregate maybe we can talk about and assemble some strategies for dealing with this thing. One question is, what do you mean swamp. Let’s get real, so what really happens nuts and bolts when you start putting everything in one big pot and say, okay, here’s your data now go and play.
You get a mess, just like a really messy warehouse. What are some abstract things that we typically think about at least as a tech company an Internet company? We log people’s logins, we talk about whether or not they purchased something. What was impressed. What did they see on the website. Did they click on something. Did they hover over something. Now you can even assess to see how far they scrolled down the website, when did they spend their time. These kinds of data are pretty abstract and they’re pretty well-understood means of collecting them, but it’s still a mess because the engineers that help you set up a website that has all this stuff.
For Expedia, we’re a travel company. One of the unique things about Expedia is we have not only hotels and planes, but also trains and also boats and also rental cars and we can even tell you things to do. There’s a whole lots of different engineering teams that go into each and every single product or service, so before you get to the chance to click on something, let alone purchase it, some untold number of micro-services have gone through.
Each one of those micro-services triggered a log each one of those logs isn’t saved. Lo and behold, as has been brought up earlier today, sometimes you go into those logs and you find interesting things. We did this, for example, in the hotel sort logs, we’ll dig into it, realize, wow we can do some predictive analytics on just based on what people did. There’s that, but then every line of business– I guess that’s what I would just point out, every line business has its own structure. What I mean by that I mean like intellectual structure.
For a hotel. A hotel is a very specific item in a very specific place, but when you purchase a flight you have one-way round-trip and then here’s some industry lingo for if you want to take home something to impress your friends’ open jaw, OJ. OJ stands for if you say you start off at JFK and you go to LAX and you come back to LaGuardia, that would be an open jaw. It’s not exactly close or you can do talk about multi-city stuff. These data structures are different.
It doesn’t fit into this nice paradigm, so you have to deal with a slightly more complicated abstract level of infrastructure, which adds to the mess, the complexity, if you like. Then finally, once you have all of this set up you want to be testing. The data you’re trying to model, the pipelines you’re trying to set up are constantly in flux because maybe you’re not just testing algorithms. Maybe you want to test do I want smiley faces, do I want frowny faces, do I want blue smiley faces all these things, you want to that find some new way to track, so this is a mess.
Another way to phrase the problem is so now they’ve given you all of this data. Can you please turn this junk, this huge storeroom of mess into cash. They give you some buzzwords, like Spark, Hadoop, which we’ve been using a lot of. This is why we’re here today. To talk about data platforms so that we can do data science to make money. This is by way of a preview of how I’m going to organize the talk.
Let me give you a little bit of context, a little bit disclaimer about things I have to say. I’m going to give you a lightning review of data platforms. Partially because I assume that some of you are not on a data platform. Partially because it tells my own story, and help you give further context to where I come from, and my perspective on things. Partially just to have three points.
You got one of the jokes, awesome. All right cool, so context disclaimer. I’m an academic. I come from a theoretical physics background. I am not a machine-learning PhD. I’m a PhD. in physics. In that sense, the way I think about things is always in terms of spherical cows. Like on the grand scheme of items, from the ways of thinking about things. We have theoretical physicists on one end, and you have practicing economists on the other end.
I’m way over here, so everything I have to say is going to be modeled in that direction. If something sounds really abstract, although I’ve noticed there’s a lot of– Like machine learning is a great abstract word that people have been throwing around today. If something sounds abstract, probably is, and we can dig into it and talk about it.
At Expedia, we still have a lot of work to do. We’re still trying to completely move ourselves into the cloud, so this is not something that by any means we’re done with, so definitely open to suggestions.
I am going to assume that you’re going to take my opinion as one of an ensemble of opinions that you’ve collected over this workshop today. Before we get to the meat of the talk let me just do a quick lightning review of data platforms. Everyone remembers queuing. Has anyone here worked with supercomputers? Ever– Yes, so you remember– I went to graduate school in Texas. They were a couple of really big high-performance computers that we had access to at the Texas A&M. Lone Star was one of them.
You had to stay up really late because we were first-year grad students, and we didn’t have the appropriate– What would you call it Q-priority. We wouldn’t have to send up our code at 3:00 in the morning in order to get processed, and this is just to do homework. There was a lot of waiting around, a lot of public consumption of the same resources.
It was great to come to industry and discover that, hey, we have our own. We have our own supercomputers and there’s multiple super– Wow, we got two buildings. One of them actually is in Phoenix, so it’s cool to visit.
In some sense that was nice especially considering how our team, the team that evolved into the data science team at Expedia, really started pushing Hive and Hadoop early on. At first, it was awesome, and then you start to realize that you got a bunch of business intelligence users it’s fine, it’s good. It’s good to democratize things, but then they started running table scans of the entire data. They don’t know what a partition is and until somebody finally said, no you’re not allowed to run this query unless there’s a where clause with a credit. Oh, boy.
Just to give you some further context, other things that we run into. I talked about micro-service logs that tend to be useful for things, especially for debugging. The micro-service log that we have that deals with assorting, each row is three megabytes. Each row. Each row represents one user search, so if you’re interested in the past days worth of searches and you’re trying to query that, maybe you can get a result, but if you’re asking about something on a time scale of weeks or months, forget about it.
Even that is not good enough. Now we’re in the cloud where everything’s virtualized. You can call virtual things from virtual things, but point is that you don’t– This was supposed be the slide where I sell you on the cloud, but people have done a lot of that. Let me just hit the high points just for the sake of a video camera, if you like.
One is, you can assign tasks their own virtual hardware. For example, the search logs that I just mentioned. We can spin up our own presto cluster, dedicate the presto cluster to, say three months of data. That is the only thing that cluster has to do, and so when we run queries against it, we want to run these queries against this massive table to debug things or see how the search results are doing.
We can do it quickly because only one or two people need to do this at a time. That’s pretty awesome. You can expand, sorry, a typo, expand and contract on demand. That’s like the killer app. That’s how Qubole came in, that’s that’s how our engineering team welcomed them with open arms, and because not only can they expand on demand but they can contract and save a lot of money.
We’ve observed a lot of that, so that’s awesome. Something that I feel doesn’t really get appreciated very much, is this concept like real-time hot swapping. I can develop a productionalized version and have it running. I can do it in parallel and then I just turn this one off. I don’t have to build up a new infrastructure. I can have as many instances of my production code going as I want. That’s cool. Software updates built and blah blah blah, you know all this.
Where do we go from here? This this is the meat of the talk, these are the things that I really wanted to talk about today. Again, going to the spirit of this is my opinion and this is a careful, this is a theoretical physicists opinion. These are my idiosyncratic organizing principles, again with a typo, gosh, sorry. IOPS. I got just really obsessed with this acronym, that’s why I haven’t really paid attention to the spelling.
These are the IOPS, the organizing principles that I think are important. These are things that I worry about, these are things that I yell at people or at least yell with people about. Hopefully, through these individual items, I can talk to you about some more kind of concrete examples of stuff that we ran into. Like I said, feel free to jump in if you have questions or at the end of course. I’ll leave plenty of time for questions and if you don’t take it, I’m going to keep talking just to see you know.
All right so clarity. Data clarity is key, so this is the whole thing. Let’s break down the silos, let’s put all the data into one big giant lake. We did that and it is a mess. It is a mess. Where is the data? Where’s what data? Car data. Car data specific to pickup location, drop-off location geography, where is that. Who owns what fields. Let’s get some context going.
Imagine if you will, you have some infrastructure, some setup.
You have for every page event that happens on your website. You log it in some way and you probably log it in some way that’s structured so at least computers can understand it. The fancy way to do that these days is with JSON. Again this huge nested representation of all the possible objects that could happen. There’s supposed to be logs, and each one of those, so for that chunk of that JSON schema that’s tied to say, car rentals, you’ve got a chunk that specifically represents pickup location geography details.
Who owns that field? The car engineers do. Not the data, not the cloud folk, not the machine learning folk, but the car people own that, but who in the car team can I contact when one of those fields go missing. This is where you see these are problems that you run into. Like what does this field even mean? There’s also a lot of holdover, I know from old Microsoft sequel days where people had a size limit on how many characters that field could be called.
You get this practice of removing all the vowels and you look at– What does it even mean. If you’re looking for stupid best practices that will actually save you a lot of time and headache in the future, especially if you don’t want to explain the table to somebody, just use human-readable language when you’re naming fields. That’s something you can definitely do. Fields, classes instances, file names, directories, S3 buckets, human-readable. Why not, we have the space for it, it doesn’t matter.
Where did this field go? Every once in a while something will just disappear. Some of the things that you might worry about is case sensitivity, CamelCase versus not CamelCase but what I’m thinking of in particular is– This might be too in the weeds, but I’m just going to give you a good example, seems sufficiently abstract like hopscotch. Is that one word or two words, because it matters if you’re using CamelCase. Hopscotch returns no hopscotch, returns a value. Things to worry about, especially when designing your names. Things to think about.
Why is this field null? Who do I talk to. Who do I call on? Again, with the spherical cow, but publish this internally. If you’re going to be the person in charge of say– You might want to design, so this is again, assuming you’re going to design your infrastructure in some way. If you’re going to build a massive data lake and you’re going to break down all the silos at least record what silos were where and who’s responsible for inserting things.
If you think about it in terms of hydrology. I’m in Seattle, Pacific Northwest. We have a lot of rivers, we have a lot of glaciers we have a lot of cascades. Think about it as hydrology. What is coming into your lake, your data lake and what is coming out. Don’t let junk in, it’s better to have no data than bad data.
More on data KOI, here’s another thing. This is not a critique, this is not the best practice thing. This is just what happened. Suppose you’ve got some nested JSON and you’ve got small tiny little files that are gzipped. This has just happened because it’s just what the engineers in your system designed and maybe it has a really good purpose. That’s great, unless you’re using Spark, in which case it’s a mess because spark doesn’t like those little files. They’re like big chunks.
They don’t like these weird compression schemes where you can’t split them up. Just tune the type of files that you have to the task at hand and Charles is in the room. Charles is going to have a really great talk, where he’s going to tell us all about this so go and see his talk because this– I’m very excited about this because this is one of these things where we start working with Qubole. We start working with people who work on Spark and that sort of thing.
When we first started doing our car set models, we’re trying to pull impressions data. Data that we can train a machine learning model on. One day’s worth of data took two days to pull [laughs]. You’re not going to get anywhere. It now pulls in seconds level, so thanks to a lot of work that Charles did, but also thanks to doing other things like specifying the schema when you’re loading in parky files, that sort of thing.
It has a two order of magnitude influence, at least on our workflow, so be careful with your data and how you organize your data and think about it. That was the first principle, clarity. Clarity is really important and that’s why I put it first.
Second, engineers are not data scientists. They have different– Who’s any data scientists in the room? Yes. Any engineers in the room? Yes, right. To the data scientist, engineers are not data scientists. They have different priorities than you, they have different philosophies than you. They’re, why would you need to do that. Here’s one thing– Mike is also in the room and I remember distinctly like, thank you for setting this up for us.
This is one of the things that was like a glaring omission when it came to things that data scientists need. Scratch space, we moved on to the cloud, hey, I’m trying to build up a training set so I can train a model on it, where do I save my data? What do you mean? It took a long time for us to establish just to carve out some space in S3 that we could just create a bucket, put data in.
To a data scientist you’re like, what do you mean, but to engineers they just weren’t thinking that way. Scratch Space, some other things like in– Sorry in advance, I’m sure a lot of you heard me moan about this before, but these are other things that you really need to consider when you’re moving to a cloud. This is incidental complexity, so Cluster Bootstrap Permissions because as a data scientist.
I want to try out new algorithms all the time. I need to cycle through maybe the first– I don’t have time to go and talk to somebody and say, hey, by the way I’d like to install this package but I don’t know what the parameters are and I don’t know what really would I need. I have it working on my own computer, but I know that you were doing distributed stuff. I don’t have time for that, so it would be much easier if I can cycle through the permissions myself. That’s something to consider.
Access to S3 Buckets. It’s great that you have a system that’s– You have a test account and an integration account and a product account and all those stuff. If you need to find three separate– If you have the chase email, the email just to get access to data, it’s just not going to happen. A lot of the stuff that we did when we first started out before we moved to the cloud for real. My boss, Dan Friedman VP of data science at Expedia, he likes to call like, we were scrappy. We would go in and we were arm-wrestle down engineers and say, we needed this data.
The Omniture guys, anyone worked with Omniture before? It’s like a site analytics, they store the data but people clicked on. We wrestle on that, we’re like, we need this list put in Omniture somehow, get it there. We would get it and then we’d pull the data set ourselves and we would force it to happen, we’d carve out space, but with the cloud, with all these great services, we don’t want to worry about any of this anymore and we don’t have to if you don’t want to so careful about S3 Buckets. I’ll have a little bit more to say about that in a second.
Sandbox Clusters, same thing. I need a little cluster to practice on. To fine-tune things to get before I generate a really big productionalized environment. Maybe from an engineering perspective you think about this in terms of test and int. I’ll get this to another point later, but those words aren’t immediately obvious to us, what that means? Yes? One thing we need to do is share notebooks across accounts. Expedia is a conglomerate, we own lots of companies and for good reason. I feel like we have a lot of benefits to add.
One of the things we’ve recently acquired is HomeAway. HomeAway is a vacation rentals company. HomeAway does a lot of interesting things with their sort as well or at least they’re trying to. We’re trying to collaborate because we’ve got this years of experience with hotels sort. You clearly need some help with your hotel sort. Sorry guys.
You could probably benefit with the way it is. We could benefit from exchange, but they’re on a different account. They’re completely different Amazon account. How do we exchange notebooks. How can we share things. That’s something that you need to add functionality to so how did we solve this Qubole links to GitHub. Awesome, thank you.
Final point about this. We are data scientists. We do not speak AIM. We do not speak roles, IM roles [laughs]. You cannot simply say if you want access fill out this form. What does that even mean? Okay? Careful.
Another thing– I’ll save it for one of my next points. One other thing and I’m sorry I don’t have the link for you. I can hunt it down but there’s a really great article (which is now nameless, apologies) but is, ‘Engineers Should Not Write ETL’s’ just look it up. It’s a really great article. Write your own Pipeline and the moral of the story is you need to control the data that goes into your model. You should write your own Pipeline. Let the engineers do more abstract fun things.
Otherwise, you kind of have to sit down and explain to the engineers what Data Science is. In fact you should do this anyway, so never mind else explain to engineers what Data Science is so that way when you need to find that pickup location for a car geography information say that you want to get the lat/long of where Enterprise Rent-A-Car is in Ravenna in Seattle and they say, “Well who cares?”
You say, “Huh-huh but I can make $2 million a year if I only had that data.” Now all of a sudden you’re helping the company how to make lots of money. You need to explain Data Science to engineers. PMs are not Data Scientists, any PMs in the house? By P I mean product or project. Okay, good sorry.
Once upon a time in a Flight Data lake we had a Data lake specifically devoted to You flights, this is a little bit off the car topic. Only 10% of search impressions were recorded. What does that mean? When I mean is to say is the cheapest 10% of all possible things a user could have purchased were recorded. Now imagine my position, I’m a data scientist–