This post is a guest publication written by Sean Downes, the Senior Data Scientist at Expedia Group.
Expedia Group is the world’s travel platform, with an extensive brand portfolio that includes some of the world’s most trusted online travel brands. We help millions of people book their flights and cruises, hotels, holiday rentals, and activities. We have over 22,000 employees and over $10 billion in revenue, so it’s a large enterprise — and data science plays a large part in our success.
I’ve been with the Expedia Group since 2014 and we’ve come a long way. A few years back, some folks didn’t even know what statistical models were. John Kim, then the Chief Product Officer for Brand Expedia, established a ‘test and learn’ culture that turned the company around. The associated data collected was the birth of data science for Expedia Group.
As the tip of the spear, our (proto) data science team started out with collaborative filtering techniques and other similar data-driven models. These models led us to a study of the clickstream, from which we settled on representing our hotel-sort problem as a ‘learning to rank’ machine learning problem. The iterative process of feeding these algorithms caused our data needs to grow exponentially. For example, when seasonality was included in the algorithms, the number of features to compute, store, and service grew significantly.
Owing to growing complexity, there was considerable pushback when we first started introducing data science. The challenge was to convince the old guard we needed to be data-driven. Even more importantly, we needed to be science-driven; moving from opinions and anecdotal evidence toward models that reflect the complexity of our users’ actions and intent.
First Steps to Success with Machine Learning at Expedia Group
We’ve achieved that, and the key to our success was working with product management to help ease data science into the organization. First, we set up a small ‘united’ team who focused on tackling business problems and not over-complicating data science. We slowly introduced valuable data science use cases in already data-driven areas of the business, so we didn’t have to reinvent the wheel. That meant we could show value with less complex effort, earn more projects, and build interest incrementally.
Here are a few of the primary steps we took:
- Focus on the Business. We knew we had to align data science with the organization. So we moved our DS team to a matrix structure, where lead scientists work with each business unit within Expedia Group based on subject matter expertise. For example, we have data scientists to optimize our hotel recommendations to site users, our recommendations for activities, our supplier-facing activities to help our hotels partners sell better, and our pricing and bidding systems.
- Align with Engineering. We worked with our engineers to ‘productionize’ and ‘operationalize’ data science. We developed this by first providing a particular business with relevant data sets, then setting up a pipeline to provide regular data, and from there we worked with users to get more and more machine learning models into production, especially using dynamic algorithms.
- Embrace New Technology. In terms of infrastructure, our challenge was ‘to build the new plane while flying the old one.’ When things first got started, we were just pulling data from SQL server-type databases. We recognized we needed to gather and use much more data, to provide more customer-driven insights and build more machine learning use cases.
To achieve this, our biggest step was to migrate from an on-premises data center to a cloud infrastructure using the Qubole big data platform. This meant we were able to scale out more and run models much faster. This, in turn, empowered the data science team to prove its value and thus create new projects.
The cloud afforded us vast growth in the data being stored, which opened up many new projects. Unfortunately, this also generated new challenges. When we first moved onto a cloud platform, it took us more than 48 hours to extract 24 hours of data on car searches, which is not sustainable! But we learned Apache Spark a bit more and, lo and behold, we were able to get it down to eight minutes per day — eight minutes instead of 48 hours. Jobs that historically would have taken days to run now run in seconds to minutes. So this helps a lot, especially when we’re building new business models and figuring out what’s available in the data set.
Even within our team, this pattern has repeated itself a number of times. Recently a relatively new hire managed to reduce a 19-hour daily modeling pipeline to run in just over an hour. Tuning Spark performance still requires some effort and a good understanding of how your data is shaped, stored, and balanced in the cloud. But when the code and settings are optimized appropriately, it is a powerful tool.
Key Breakthrough: Data Science in the Cloud
The bottom line in all this is, well, the bottom line. What really wins over the business is when data science improves performance, increases sales, and promotes innovation.
A breakthrough came about 18 months ago, when we set up a data science initiative focused on optimizing the amount we bid to partner with hotels. They started out taking losses, paying more money than they were making. After using data science, the program was able to improve efficiencies and the product became more profitable.
Machine learning models are becoming important to our innovation and business success. Data science is in most of Expedia Group’s lines of business, including hotels, flights, activities, marketing, geography, and fraud monitoring. Recently we had a company-wide product review, and the whole leadership team was there asking questions about what other areas can we help improve? So that’s where we are now.
The data science team was five or six people, now it’s over 30, with a planned expansion to 60 people. But the gold standard is: all of our activities are looking to improve sales, and that’s what kind of holds our feet to the fire.
Ensuring Continued Success
So how can we continue to ensure that data science is properly growing to create successful projects and initiatives? Here are seven key takeaways from our own experience that other data science teams can apply to help ensure success:
- Format data sets early for big data use. Persist data used for model design in flat files with human-readable column names. Parquet is a great, Spark-friendly format that should also play nice with Hadoop.
- Build an analytics source of truth. While it’s okay to have a data lake with many ‘test’ data sets, make sure to align heavy-use data sets by using some kind of ‘gold standard’ that is supported, documented, and published internally.
- Balance your data and optimize partitions. Make sure the underlying data files are stored in 50-500 MB chunks. These can be implicit or explicit partitions. If you are explicitly partitioning on a field, say ‘the_date’, make sure that the data is balanced well in that field. (For example, partitioning by country or user would not be a good idea, since the rows of data associated with those partitions are probably Pareto-distributed, which means one partition will be overwhelmingly large compared to the rest).
- Checkpoint (save) whatever data sets will be re-used, or have multiple logical dependencies downstream. For example, I save all the data associated with activity bookings each day. This data set is then used to generate training data, data to power our model server, and other downstream analytics.
- When using Spark, use Scala for anything that resembles an ETL. This is particularly true for UDFs and UDAFs. Scala sits in the JVM and can be parallelized where Python often cannot. Python is fine for final-stage, nonassociative tasks (like training models).
- Since you should be using Scala, you should consider building your own packages with the Scala Build Tool (SBT). This will allow a data scientist to prescribe jobs explicitly that engineers can consume. It also helps avoid a lot of duplicate code, and helps with code review and debugging. Additionally, Scala is a compiled language (unlike Python), so compiling locally allows you to find bugs before you draw on your cloud resources.
- Consider using or extending Transformer Classes in the Spark PipelineML for rapidly deploying your ETLs. It’s a built-in monadic coding API and can help you plug right into modeling algorithms that are written as PipelineML Estimator Classes.
By incorporating these ideas, you will set your organization up for agility and scale. Clarity from good data management is the single most important factor in speed. It will keep your scientists focused on building models and pulling meaningful insights — not just the data itself.
Learn more about how Expedia leverages Qubole. Find out how Expedia is using Scala with Spark on Qubole to deliver 500 million personalized emails every month.