Karthik Breakout Session – Data Platforms 2017
Karthik Subramaniam: My name is Karthik Subramaniam. I head up our data platform theme that’s part of our broader data engineering and science organization. Quick agenda over here. I want to first give you an intro to Under Armour Connected Fitness, how we got here, what’s the journey look like for us; which is going to be relevant to how we thought about building out our data platform.
I’ll also give you an overview of the different data assets that we’re working with today, which is pretty interesting and exciting. I’ll talk about the data platform that we built. Finally, the topic of this talk is really about, how do we enable intelligent features on top of our data platform?
A quick history– anybody here in the room use any of our apps? MyFitnessPal, MapMyfitness, Record, and Endomondo? Great. We all got here through a series of acquisitions from Under Armour sign.
Back in 2013, Under Armour acquired MapMyFitness. Then in 2015, they acquired us or the company that I came out of which is MyFitnessPal. At the same time, they also acquired a company based out of Copenhagen called Endomondo which is a really popular running app for most of Europe. We also got into the space of some connected devices. We released a wearable band and a heart rate monitor. Together we have over 200 million registered users and we have a lot of data around health and fitness.
When we think about our data, we usually categorize it into one of the following segments. This also lines up with some of the apps and the capabilities that the apps enable. The big one is really nutrition. This came in from MyFitnessPal acquisition. With MyFitnessPal, we have a really good understanding of food, users behavior around food, the seasonality when it comes to food.
In fact, one of our busiest time of the year is actually New Years. You can imagine everyone makes a New Year’s resolution to eat better. Usually, the Monday following New Years we see our largest spike in users, which is great. We also try to pack– that means Christmas time sucks for us because we try to pack a bunch of features in that we want the new set of users to get exposed to. We’re always trying to get a whole bunch of feature that’s packed in right before the holidays.
We also have a really good understanding of activity. This is steps how people are being active throughout the day. All this comes in through various connected devices. The Fitbits of the world, the Garmins of the world. We also have a really good understanding of workouts. People can log different workouts that they’re doing. Finally, we also have a really good understanding of how people are sleeping. This comes in either through them logging it directly into the app or through their wearable device that they have.
This is the slide that really gets me excited and energized about working Under Armour. We have a really interesting data set and we actually have a pretty holistic data set around a user’s health and fitness data and all other aspects of that. We talked about food. We have over 20 billion, it’s actually closer to 30 billion food entries. We also have about 7 million food items and that’s growing.
We have really good understanding of, based on 120 million users, how they’re interacting with food and their behavior around food. I also mention we have really good data around workouts. We have 600 different types of indoor and outdoor workouts that are logged. We also get a lot of activity throughout the day. We see sleep pattern. With the Under Armour acquisition, we now also have a good understanding of retail and e-commerce data.
Couple of the questions that I got asked throughout the day was why Under Armour? Why did Under Armour want to get into the business of health and fitness? The vision that I think we’re all working towards is if you think about your ultimate wearable, it’s really the clothes that you’re wearing. Under Armour is a very innovative company when it comes to the kind of material that they use in their clothes.
They see this as a great extension for how if you can build sensors into your clothes or even if you go down the path of being able to give a user really good understanding of what they’re wearing and how they’re reacting. If they are a performance athlete training for a marathon, how can we bake that into the clothes to understand the sensors and then be able to give them insights on demand.
That’s a really futuristic vision let’s just keep that within the room [laughs]. We’re not there yet, but what we did do is we launched the first of the first embeddable shoes. Anybody got a Gemini or heard of the Gemini shoes by any chance? Okay, you should all go out and get one. It integrates really well with our apps. In fact, it’s an embeddable chip that’s put into the new Gemini series of shoes with Under Armour.
What that does is it actually allows us to get a really good understanding and track certain things around running and endurance that is better tracked through a shoe versus a band. We’re getting different kinds of health and fitness data. Different kinds of statistics and we’re able to gather all this information and feed that back to our user.
Some of the characteristics of the data that we deal with. Time series is really important to us especially when it comes to activity. We get time series data throughout the day. We get data from GPS, geospatial time series data, which is also really interesting. Our volumes are crazy. Just with MyFitnessPal alone, we get between 20-30 million food entries in a day. We also have burst of data throughout the year. We also deal with different kinds of seasonality. I was talking about her New Year’s resolution. That’s the expected burst, but we also have different events that happen and all of a sudden we see bursts in our activity data.
That’s cool, this is great for us [laughs]. I have to throw in some of our brand athletes names in there, so.
The big focus for us is definitely after acquisition. We have all these apps that were startups in their own rights and they have all this siloed data and apps or generating data to feed their users and their ecosystems. One of the big focuses for us from early on was how do we get at each other’s data and how do we help the data coming from MapMyFitness drive insights for a user about what they should eat and how they should log and vice versa?
One of the first thing that we did was the data team that we have we became a horizontal team across all the apps and one of our first goal was we have to centralize or data so we can get access to it. Seems like common sense, but it took us a beat to figure that out . It was an easy step to take.
One of the first things that I wanted to evangelize within our team and across our product teams and our engineering teams is the value of the platform that allows us to do the kind of things we’re talking about. We want a platform to really provide the insights to our product managers and our analysts and to other engineers. We really want to allow the community of teams that we have to understand our behavior, user behavior around these different apps. Finally, we want to build our intelligent features.
That was also the order of priorities we wanted to work things in. We wanted to make sure we drive engagement. We wanted to make sure we’re building the right products and then we also want to make sure we’re introducing data products into these intelligent features so that we can really continue this iterative cycle of engaging our users.
How do we do that or how are we trying to do that? This is a really high-level architecture for us. There is two main components of this. We have Kafka as one of our staple elements for our data infrastructure and S3. Why Kafka? It’s fast, it’s scalable, it’s fault tolerant, it’s reliable, all the isms. We also made a decision pretty early on to work with our engineering teams to move to more of a microservices architecture.
We wanted them to start admitting events for all kinds of different interactions that are happening. Then we wanted to centralize all this data. We wanted the apps, the MyFitnessPal app, the MapMyFitness apps, the Endomondo apps to be admitting events to Kafka as our central messaging bus and then we wanted to centralize all that data into S3 so we can act on it.
You’ll notice a bunch of circles around it. The important takeaway here is I think you’ve seen something similar to this earlier today by the other Karthik, where the latency and how we want to deal with our data and how quickly do we need to respond to what the data is was important component. If we wanted to build closer to reactive systems right we can move to the left we can consume streams off of Kafka. If we wanted to do any aggregation of our data and batch processing where we have greater latency’s that we can deal with we can move to the right. This gave us the flexibility to do either.
Finally, we would actually take this data and then push it out to RedShift or other business intelligence tools that we have for analytic purposes and so on.
If you take a little bit of a closer look, this is the view into our batch processing flow. The idea is we wanted to have the apps and services admit events. We also have a lot of third-party providers, so we do have a premium model so we have data that’s going to Google, Stripe, Zuora and we want to pull all the third-party payment data in. We also have data that’s still hanging out in a little bit of our legacy relational data stores.
We wanted to figure out ways so that we can snapshot data in from either third-party api’s or directly from the data sources but the primary method for us to do data replication was going to be through the Kafka system through the Kafka message bus. We wanted to move all our micro services and our apps to main admitting events and then all that gets archived into S3 and then from there we can do the decoupling of the compute and storage so then we can use different compute technologies to interact with those data and then do some interesting things with it.
What we were trying to solve for really is automation. We wanted to move away from the manual ETLs and manual pipelines that the teams were building for their individual apps and stacks, we wanted to standardize the way that we store data, and how we access data. We wanted to make sure things were consistent in how we were pulling all their data in and then also surfacing it back up for all of our different use cases.
In order to do this, we developed a common workflow. The common workflow is really our approach for how we want to take raw data that’s coming in from the apps and the events and we wanted to act on it and then push it out for different use cases that we have. There’s a lot of stuff going on the slide, but it’s actually best to look at it from bottom-up. I should probably redo this slide so it’s top-down but anyway.
We have a lot of users of this data. We have analytic use cases, we want to build different data products, we also have various data science use cases to do time series analysis and then we also have data exploration that takes place.
In order to enable all these different components we actually land our data in Avro format. We do some transforms on it, we actually take it to a derived state. We attach a schema to it at this time. From there, we can actually export it out into different formats, whether CSV, JSON, Parquet.
We also build out what’s called our source of truth. When we talk about events data it’s immutable atomic data and what we want to do is we want to compact it to give a single source of truth. An example of this is let’s say you as a user are logging a food item right so that shows up as a fact right or an immutable event to say food entry created. You then go in and update the food so that shows up as an update event sometime later. Then you go in and delete a food item that shows up as a delete later.
If you want the latest view of that user’s food state, you have to do a compaction of all three of those events to provide that view and that’s what our source of truth does. We run a multi-stage compaction step and that’s a whole another presentation we just talked about, but we have that available. You can start running presto queries against that but then we can also run presto queries against the crud like events that exists that are coming in from the different events and the services.
That was our common workflow and I want to use that as a baseline to talk about how we built our intelligent feature. Show of hands how many folks have heard of restaurant logging within MyFitnessPal? Cool, very few. That’s what I was afraid of. It’s actually one of my favorite features. Just to set the stage for this, we launched this feature and it’s sadly hidden a little bit, but it’s there.
Within MyFitnessPal, when you go to log a food item there’s a little drop pin icon that shows up and if you click on that based on GPS location it’ll tell you all the restaurants that are around you. You can then click on a restaurant and then it’ll tell you through third-party providers we go and scrape the menu data for that restaurant so we bring in menu data and then we can actually match it against our food database and tell you calorie information for that menu item. Something that’s never been done before.
We wanted to make sure when we did that that it was accurate. Our food data is user-generated content, if you search for the word banana you’re going to see hundreds of different entries for bananas. How do we surface the right banana to you during the search? That’s the problem we were trying to solve. We wanted to give the user the most relevant food item to the restaurant’s menu item that’s listed. This is where we took advantage of our platform and we took advantage of machine learning on top of our platform to surface that.
The example here we’re talking about a Cobb salad and a Caesar salad and if you actually click on the item you’ll see all the calorie information for the Cobb salad and we also tell you which Cobb salad we matched it to from our food database. We also give the user the ability to say, “This doesn’t look accurate,” and we take that in as a signal so it’ll allow them to search for the right Cobb salad and pick the right Cobb salad, which I’ll talk about in a second.
In order to do this, we actually instrument the heck out of everything. Everything within our app is instrumented as one might expect. We look at how many times somebody accesses a menu, how many times people select menu items. That gets scaled out across the millions of users that we have that are interacting with this feature. All of those events right are getting sent to our Kafka pipeline that are getting to archive to S3 and then we can start interacting with it using different compute engines.
Some of the use cases that we have around how we use the events. The first one is we actually use it to trigger our offline processes. Anytime we want to act on certain event data, we can set up a process an offline process to do that. In this particular case, we’re doing it for our match quality, so improving the match quality of new data that’s coming in. We also do a lot of analytics and quality assessment which I’ll walk you through real quick.
First of all, we have a job that wakes up every hour and it takes a look at what are the events that are there for that given hour. The menu access events, the restaurant access events, all that stuff. We take a look at how often the restaurant has been accessed whenever a new match has been created by the user. This is where the user says the match that we provided isn’t good enough, so they’ve created a new match by doing a search against our Food DB or whenever new menu items have been added. We actually schedule this out of cueball. We recently moved this to airflow, but this is just a snapshot.
Once we do that, we take that event data. Again, events are coming in through Kafka we’ve stored it into S3. Now we’re exporting it out into a CSV format so that we can actually push that into our machine learning framework so that we can act on it. This is the second stage right so now we’re triggering an offline job to go gather some data. We now want to use that data to improve the match quality. Once we have that data, we’ve pushed it out in the CSV format and then we have a matching algorithm, which is based on Pairwise RankSVM to do the selection process or the rank process.
Quick overview of this. I don’t want to get too detailed into this, but essentially, we looked at various different models for the best way for us to do a match. We looked at string similarity, and we looked at other approaches, and we came up with Pairwise SVM as the best candidate. The basic idea behind this is, you have match A, which is our match that we’ve provided based on some of the training that we’ve done, and then number two is what the user selected.
If the user went in and selected a different food item, so that’s candidate match number two. We compare the two matches to figure out what’s the best match for this food item, and then we surface that back up to the user.
This is, again, walking through that scenario that I just talked about, where we have a food item. A user searches for Americano, and then they see that we’re providing a match, Americano-Americano. They feel that it’s not accurate, so they go into our food search view here, and they’re searching for the word “Americano”. As you can see, there’s a whole different list. This is, again, user-generated content. We see tons of Americanos.