Getting Healthy with Hadoop: Big Data Analytics in Healthcare
Speaker 1: Guys my name is Caltec [unintelligible 00:00:12] just a quick introduction. I’m with MyFitnessPal now under our own connective fitness. We’ll talk about that in a second and I want to introduce Pujat.
Pujat: Hey guys, I’m Pujat software engineer in the data science engineering team at [unintelligible 00:00:28].
Speaker 1: Today we’re going to talk to you about our approach with batch processing, big data within MyFitnessPal. Just a quick show of hands from everyone here how many folks have used MyFitnessPal before? Awesome, keeping your hands raised, how many folks have used MapMyBrand, MapMyFitness, any of those products? Good. The point is today we’re going to talk about MyFitnessPal and how we joined under our own connective fitness very quickly. We’ll talk about our data assets, our data platform, what are we trying to get to, what are we trying to solve for and finally talking about our use cases.
Pujat’s going to walk us through our common workflow to help enable our platform. Just a quick introduction to MyFitnessPal, I think since most of the folks here have use it or seen it. The big takeaway here is we have over 90 million registered users, we have 7 million searchable food items and about 19 billion logged food entries so far. Back in March we were acquired by Under Armor and with the acquisition we joined two other companies, MapMyFitness and then another company in Europe called Andomando. Together we have formed Under Armor Connected Fitness and we have about 140 million registered users across Under Armor connected fitness.
You can read all about it when we post up the sides, I won’t go too much into the detail here. Well, what does that mean for us today? Well that means we started with food and nutrition information. I talked about almost 20 billion food entries, 38 million recipes, 90 million users and we’re now combining that with a lot of workout and activity information. With MapMyFitness, we have a large user base that tracks over 600 different types of fitness activities, we have time series, GPS data that includes geospatial data as well as real time GPS data.
Which is really interesting dataset, we get music data with workouts, we also get activity throughout the day which is for a user which is very interesting. We also understand people sleep patterns and we get to kind of combine that with calorie information from MyFitnessPal and see what people are burning in terms of calories. In Under Armor, most of you are familiar is traditionally been a retail company and so we now have really interesting retail data that we can pull in to this as well. I forgot to mention earlier if you guys have any questions please hold it till the end and then we can try to help answer it or I’ll just let Pujat answer all the questions.
Some of the characteristics of our health and fitness data as I mentioned time series is really important, you have wearable sensors, GPS data, real time geospatial and the list is continually growing. We have a whole host of wearable technologies that are coming to market, we have this whole quantified self-movement. Trying to get data is really important and time series data becomes very important and volume is really crazy. We have about 20 million food logs that take place on a daily basis, and we also have a burst of data that happens which is kind of interesting.
For example, you can imagine when there’s a marathon or race, scoping it to specific cities or regions you have a burst of run or workout activity. Also predictable pattern where at the beginning of the year, everyone makes a new year resolution to say they’re going to eat healthier, work out, so we actually have a growth in our data. We have three times the volume of information and data that’s flowing into our system but that’s predictable year after year.
Speaker 2: When is the least fit time of year?
Speaker 1: I will answer that question now but the least fit time of year really is around the holidays as you can imagine. We actually have between Thanksgiving and New Year’s usually tends to be the least fit time of year. We have really interesting datasets, we have growing a dataset asset. We’re now starting to think about how do we really build a platform to enable this growing changing dataset in volume of data. This is influenced by confluence, this graph but there’s really two big components of our data infrastructure. On the left side, you’ll see Kafka and on the right side you’ll see S3. Kafka is our real time data streaming platform and that’s the results central repository and the key component here really is down at the bottom.
As most of you are aware, we’re using Kafka for a lot of our real time or near real time streaming information and the processing of real time data. We want to use S3 for a lot of our offline batch processing of data. We’re supporting several use cases, one of them being business intelligence capabilities and various other use cases which I’ll get into in a second. A really cool example in our space at least for real time data that we’re processing. One of the ideas is we have this notion of campaigns, so we have campaigns where a user campaigns around fitness where a user signs up for a campaign and then they log or trigger an event.
That event gets fired and then there’s some processing that happens in the push notifications for folks that have signed up for that campaign. For that fitness get notified. An example is like your friends get notified of the fact that you’ve finished a particular campaign around fitness. For us a use case around batch processing is really around weekly e-mails so you want a user at the end of the week to get an e-mail aggregating and summarizing their workouts, what they’ve eaten. That’s a pretty classic example.
If we dive in a little bit deeper particularly around storing our data so one of our big focus is really how do we move all of our data into S3 and centralize it so we can make it accessible? We actually have Kafka sending data streams with Aton service events, we also have third party information that we want to store. We want certainly third party information flowing in as much as possible through Kafka for the data stream, but there’s going to be situations where we want to do snapshots of that data for batch processing.
The same is true for our relational databases so we might want– We want the streaming of relational database information to go through Kafka but there might be situations where we need to batch process it or take snapshots of the data.
Now once we have all the data centralized within S3, we use high presto which enables Huble for a lot of our clearing and batch processing. Where you want to enable data science use cases using spark data versus helping us with that and certainly with business intelligence we use red chip as our analytics data warehouse. Given the acquisition and various sort of datasets and the centralized repository that we’re moving towards, these are just highlights around the problems we’re trying to solve and overcome.
First of all we really want to build automation so we want to get away from manual data replication and custom tooling around that. We want to try standardization across all our data so common data specs as well as directory structures across connective fitness data and certainly around consistency. We wanted to develop a common workflow to really enable health and fitness data.
The common workflow piece is what we want to dive into a little bit more along with the use cases. On this slide, best way to read it is starting from the bottom up, so there’s– I talked about some of the use cases but just to dive in to the use cases a little bit more. We have a lot of analytics use case certainly around business intelligence, there’s different business metrics that we need to provide like DAU, MAU for different stakeholders. The other really interesting use case that we’re trying to enable is user insights across apps, so now we have users that are on MyFitnessPal and the same users that are on MapMyFitness apps.
Being able to connect the users behind the scenes and be able to provide them better user experiences within the apps is something that we’re really thinking about, certainly around data products. One of the examples that we’ve listed here is around search improvements. Where as foods are getting logged, we count the food logging and then we promote certain food items based around how often they’re logged. That’s an example of a search improvement we want to help enable with the data product. Certainly around data science, we have personalization and recommendation engines, so for example let’s say you ate a banana for breakfast and had a burrito for lunch.
What are some recommendations around food that we can make to help you stay within your goals and calorie information. Finally to support data exploration use cases where you can allow ad hoc querying into our data source. The common workflow we’re really kind of trying to think about it from the perspective of enabling batch processing and serving. If you think about the raw and derived views, that’s really supporting our batch processing capability and our source of truth is really enabling our serving layer. Now I’m going to hand this over to Pujat a little bit to talk about diving to use cases some more and walk us through this workflow.
Pujat: I want to talk about example of workflow that we have and how we take data from Kafka to S3 and then how we leverage Qubole’s capabilities to actually process that data. As you know, data comes in at the data stream into S3. The data that we have in Kafka comes in as [unintelligible 00:11:30] and you want to store that in S3 in a particular directory structure.
The reason that we have this kind of directory structure is that it enables us to have data from multiple different data sources. Example we have MyFitnessPal, we have MapMyFitness Andomando, so providing the name space is really important and tying that with the food type– Sorry [unintelligible 00:11:55] type food integrated. The reason that we have food — the event type as separate subdirectory is because we don’t want to store all this data into one [unintelligible 00:12:05] will be difficult for us to process that. As we partition an art [unintelligible 00:12:13] that enables us to have early workflows or daily workflows.
Here we have 20 million food entries per day, 2 million workouts logged per day, it’s like different events will have different volume coming in. Let’s discuss how we leverage Qubole’s capability then use hive to process data in S3. First of all, as we all know we create actual tables map this skin onto the directory structure where we have the raw data limit. We have time series data coming in by Kafka as streams. Certain events might be bashed up together and so that’s why we have like events [unintelligible 00:12:56]. It allows us to bash certain events together and later on we have to explode that into individual events.
Now what do we have? We have external tables containing data as raw [unintelligible 00:13:10]. We want to analyze that Jasan and that Jasan is nested, we have to flatten that Jasan to process it in a high performance. We want high performance on that, we have a derived table which are resource data as ORC. In this example use case, like Caltec talked about it on enable data products like search improvement. Let’s say we want to count the number of times the food has been logged, in this case we have food ID version because [unintelligible 00:13:41] not, when was it last modified.
We create this ORC table and restore it in this format. Now we actually take this data coming in from raw and move it to ORC and this is where we do all the transformations from Jasan to actual flattened ORC table. The idea was to create a source of truth, you want to enable people coming in run presto queries source of truth for the analytics. There are two types of events here, we have transaction limits and we have non transactional events. Non transactional events example for that would be like food search, food log, a session started user entered logged in which contains all the information right there on the event as compared to transactional data.
Where we have resource was inserted, resource was updated, resource was deleted. Once you have all these cases how do you combine the inserts, updates, deletes and the problem they’re running combining these into the big delete or how we backfill the historical data. [unintelligible 00:14:57] small data sets into their reference of the implementation from what works where we can sell [unintelligible 00:15:04]. You have these incremental data views and you have a base table, you make a reconciled view which actually aggregate in search of dates to delete and they put it into a base table. Then you purge the existing base table, that works for relatively small datasets but it’s not scalable for billions of records and it certainly does not fit our requirement. We do that to make certain small datasets.
For large datasets we actually export that to [unintelligible 00:15:35] database like [unintelligible 00:15:36] for now. We are looking into better options there like H base or Cassandra. Using this workflow, here we analyzed using presto on the food entries and here you can see the pattern of food entries in a month. You can see weekend people log less and weekdays people log more. Yes, now we have five presto when we decide what to use. Hive is good for batch processing, it has the capabilities for users to define their own functions and it’s good for large aggregations as compared to presto. We just use for interactive and upgrades and for data explorations.
People like first go and use presto if it doesn’t work for their use case and they just use hive. This is some great performance and workout data that we get from MapMyFitness. The best performing is hive– I’m sorry Presto and ORC tables that’s compared to like hive on ORC or hive on CSV. The question arises why do we choose ORC? ORC is a format going forward because first of all, it has a cognizant structure, it also enables us to do specific encodings. The strings are comprised differently in [unintelligible 00:16:57].
There is a single study for all ORC files that we don’t have to think about which survey to use as types saved their translations. Any structure that we explored in ORC has a type associated with it and it works well in Presto. An important thing here is that hive 1.2 is going to have asset support and then it works with ORC formats only right now. What’s next? Well we’ll be updating to Hive 1.2 soon which will have asset support so that we can build source of truth tables in hive, so that we can merger inserts, updates, deletes make that source of truth run presto queries exploration or hive queries for large aggrevations.
We now look forward to choosing a better vertical merger and scheduler and we are actively exploring options there with pinball from Pinterest or air flow from Air bnb plus other options as well. You want to make the self-serving data platform where everyone can come in and anyone can come in and build their data pipelines, it has to be really easy for them to do that. Let’s summarize.
We’re the world’s largest health and fitness community, we have unique collection of data. Mixing food, nutrition, workout activity and sleep, we have a unique characteristics of data to help in fitness. We have time series data which is fitness, workout data, fitness activities we have real time geospatial data as well. What can we look forward to in future? You have to build data infrastructure to support all these current and future needs. As we all know that we all are moving towards a lot of sensors, a lot of variables, how do we collect all that data and process that in the best way? We’re just getting started there. Thank you guys.
Speaker 3: What software? Do you actually use to Ingest into S3 from Kafka do you use [unintelligible 00:19:02]?
Pujat: We use Pinterest implementation sequel that basically consumes from all the Kafka topics and puts it into S3.
Speaker 3: What is it called?
Pujat: [unintelligible 00:19:12]. Any other questions?
Speaker 4: Why do you use Kafka and not [unintelligible 00:19:21]?
Pujat: Why do we use Kafka and not [unintelligible 00:19:24]? We have a lot of people coming in from LinkedIn and they’re —
It works well as well, real scalable, there is no hard requirements on number of topics you can have, number of partitions. We just keep on adding more notes and it’s elastic and that works well for our use case.
Speaker 5: We’re starting here interest from a medical community to cross correlate to health maybe depersonalize it but —
Pujat: Well, definitely, we have to see how the industry shapes up around this. There’s certainly legal implications of how you share medical data as well, so you’ll have to see how we all evolve together. The health sector, fitness sector, how we merge that together.
Speaker 6: That’s the other question, can you figure out if you actually lose weight by doing more exercise?
Speaker 3: Does it? I mean, can you do it?
Speaker 1: Not now, but–
Speaker 1: I think we certainly see one of the interesting things is we want to enable use cases where you can find correlation with the data and correlate it to user behavior. There’s a lot of factors involved and it’s not as easy answers as yes, as I was kind of joking around and saying. I think the dataset gives you the ability to find these patterns in this to be able to make those kind of correlations.
Speaker 5: I was talking to data scientists yesterday differently go– he was saying that one of the things that he felt like people, you need to focus on in their work often, was how they can basically improve things [unintelligible 00:21:18] and how they could sort of keep on building out. Talking about your pipeline right now, what do you feel for the easiest to [unintelligible 00:21:24] change and what do you feel is actually the hardest, like would it be easy for you to switch over to Flume from the Kafka, is that stuff portioned, how do you feel flexible about that, without getting–?
Pujat: Well, I think at our company I think we are going to focus on Kafka and we are going to change over Flume. That’s one thing that’s for sure, but other things I think that are flexible about it, depending upon what we think is better–
Speaker 5: I was trying to stick to that specific example I was just sort of in general as somebody who’s working there day-to-day. If you feel like there’s particular analytics or things that you want to run, you have a hard time right now with the current setup or you feel like there’s [unintelligible 00:22:04] really easy for you to keep on pushing the current set up.
Speaker 1: I can kind of take a stab at that. I think that one of the things we’re always really mindful of is our platform and our tools enabling use cases that we want to solve for within our organization as well as a community. The iteration is something that we’re always thinking about, so it has to be use case driven and are we running into any kind of roadblocks, and is the tooling place able to provide that capability that we need to– or do we need to look at something else?
Certainly, trying to look at principles around, making sure that we are able to switch things out if we need to, that’s part of the consideration of being able to move fast. I think it’s not a strong answer to say we have this one philosophy but certainly something for us to think about as we grow. Our uses cases have grown and we’ve built on top of the system as well.
Speaker 7: All right, thank you very much, we’re going to have unconference session afterwards, it’s a great opportunity to ask more questions about what they’re doing, I just want to transition over to the next speaker, Asheesh from Qubole. [applause]