Sriranjan Manjunath – Data Platforms 2017


Sriranjan Manjunath: Good afternoon everyone. Thanks for joining us. The title of my talk is how we built a scale with real-time user targeting system. My name is Sriranjan Manjunath, if you don’t know who I am. I’m the CTO of Saavn. If Saavn is a new term for you. Saavn is India’s music streaming service. We started out servicing to the expat community in US and UK, but it turns out we became very popular in India.

We have a desktop site. We have a bunch of mobile apps on every platform that you can think of. You can pretty much go there, search for any song you want and start listening to it for free. The free comes with a disclaimer, but, yes, mostly. Yes, it’s in certain countries there is. In India, it’s still free and few other countries, it’s still free.

Let’s look at some of our numbers. Just to give you a little more context on what kind of data we are looking at. We have about 20 million or so global MAU’s. Close to 19 of which is from India. The second metric is very important for any music streaming startup. In fact, we pay our labels based on this. We get compensated based on this. It’s what we refer to as streams a month. We do about half a billion streams every month. It keeps growing.

In fact, we’re happy that, for the last three to four years, this number has doubled. We have about 30 million or so tracks live on 7. That supports about 90 different languages depending on the geo, where you are. If you’re in India and if you are the countries then you get all of English content as well. The content that you see is actually dictated by the rights that we have.

Now, a lot of you deal with Indian Samsung, so you might be aware of this, but over the last couple of years, there’s been a huge growth in the mobile phone market. You can actually get a good decent Android device for $40. That will let you download any app you want. It’s you’re foray into the whole smartphone ecosystem. Naturally, we have seen a huge jump in our traffic as well.

Our DAU has grown about 9-10X over the last couple of years. Ever since the whole mobile ecosystem started changing. Now, if you’re wondering what it really looks like, this is actually a photo of a random shopping mall in Mumbai. People are actually shopping for phones there. It’s not a sale, it’s nothing. It’s just a regular Sunday. This is the good side of the Indian ecosystem. The stuff that we see in the Indian market.

There are a few challenges, especially for a music streaming startup. The network conditions are pretty bad in India. It’s improving a lot with Reliance view and stuff but it’s still not there. Look at where India as compared to US. On an average 3G or 4G network has about half of the bandwidth than what you see in the US. These are challenges. These are the challenges that, we as a music streaming service provider, need to address in our apps.

Our apps should work on really low-end devices and should it should work in very choppy connections as well. Other than that, there are some other fun challenges that we fund, as we refer to in the music ecosystem as well. One, is it’s very diverse. India’s very diverse. We support about 26 official languages, which means our search queries need to understand these 26 different languages as well.

Your phonetic rules change. It’s not a regular string search anymore. It’s not string distance algorithm for those of you who don’t know. It’s a lot different. You got to find other algorithms there. Same thing with radio. If you can see the image on the right, an artist can compose in multiple languages. What does that mean? If you like Rahman in Hindi, does it mean you also like Rahman in Tamil.

Maybe, maybe not. Maybe you’re just a person who doesn’t like Tamil songs. We don’t know. We need to understand the user behavior. That’s actually a challenge for us. 26, they’re bunch of genres in every single language that you can think of. We have to account for that as well and the problem exists here as well. Indian artists tend to experiment across genres.

In fact, a single album will have songs from the same artist but belonging to different genres. We need to understand it and we have to somehow, self radio taking into account all of these things. We deal with about a bunch of music labels. 800 is actually a very small number. We actually deal with half a million music labels across the world, but then, if we don’t interact with all of them.

There are about 800 major ones that we deal with, who in turn, go and get us rights from everywhere else. What all of this boils down is context is key in music. We need to understand where the user is, what the user is doing, what will he like, what do we play for him. If we can solve that problem, then our engagement rates go up. It’s also very good experience for the user. That’s one of the key within our company.

Whenever we try to decide on a project extra is it going to impact the engagement. If you get the context right, it will have a huge change at the engagement itself. That’s what we try to keep doing. The system that we’ll be talking about in the future is in that direction.

First, let me tell you where we are and how this is relevant to data. Years back, when you started about six years back or so, we realized that we had to understand our logs, understand how our systems are laid out. How our employees function.

How our users are using our app, et cetera. Just like any other company, we created a data analyst team whose role was to understand data and then give inputs to every other function within the company. Whether it’s engineering, marketing, any of that. For the lack of the, better word, that ended up becoming a data silo. Right now, if you wanted to understand what’s happening with the data, guess what? Go to the analyst

Now, Analyst layer has formed a bottleneck. There’s only so much that they can actually do. We tried to ramp up product managers using Hive in Qubole. They’re pretty heavy users of Qubole for that matter. It works in some cases, but not always. Some product managers are okay with learning Hive, etc, but not always.

What we did out here, we identified some common patterns across all the queries that our analysts or as well as product managers want. We built some hive tables that we kept and get updated every day, et cetera and a bunch of settings. For the other side of the system, which is some engineering systems that we want to use, there we indexed a bunch of data we store it in MongoDB.

To give an example, we send a lot of push notifications. Any mobile app service partner will actually send you push notifications. That’s, in a lot of ways, it’s free growth. You get the users attention. If it’s catchy enough, the user is going to open up the app. If it’s targeted well, we have seen our open rates go up by three or four X compared to just random popular content that we see.

For that particular use case, so that we can send millions of push notifications every day, we actually indexed a bunch of attributes. We use Mongo for that purpose. I’m not going to cover why we used it but we use Mongo for that purpose. We essentially built a dashboard, which would let you choose filters. What do you want. For example, listen to an artist or listen to a genre in the last such-and-such a time.

You can send push notifications. You could also schedule these based on user behavior and stuff. The good news was, this was very quick to get to market. Analysts were doing, they already had queries and they could just go on based on what kind of queries they wanted to do. That was there, as well as, the other side of things, other systems could still use the index data.

The bad thing that we started noticing after we launched this, was this entire model is very restricted. Notifications is good but what if you want to use it for ads systems? How can you monetize based on this platform? Can we use some of these systems to talk to hive data that we are created? Maybe, but it still has its challenges. Also, we started noticing that there were too many repetitive queries being run by the analyst themselves.

A lot of product managers within the team want to use the same set of data, people, and they want the most up-to-date data. They would just re-run the same queries over and over. It’s not the most efficient way to function. The system that we’ll be looking at, Sniper, tries to solve some of these problems. In short it’s a very generic level. It’s actually a system to target cohorts of users. Cohort is as you define it, it can be based on arranged date, it can be based on a location, it can be based on some sort of user behavior. It should be generic in its definition.

What do we mean by features? What are we trying to target here? Some examples here. It can be geo artist that you listen to, genres that you listen to, time zones, operating system, devices, the regular stuff. The two important ones in here, which is international travelers and fitness enthusiasts. These, you guys might have already guessed. These are pretty abstract concepts. What do they really mean, right? We need to define these things. It’s not an easy attribute to fetch.

Again, just continuing this whole thing. What kind of queries you’re going to run? Users who have subscribed but haven’t done already yet. These are the kind of queries that the analyst would want to run in such a system. Users who have listened to an artist, that’s kind of a simple one. Users who went pro last week, that’s sure. You want to measure the conversion rates for the login screen, for that matter, or all users who have subscribed but haven’t downloaded songs yet.

That section is going to contribute to your churn so it’s very important to figure who these people are and why they not using it and such. These are some of the queries that we want to address as part of the system. At a high level, there are few things that we want to do, democratize data. I know it’s being said everywhere, but we realize the need for this within our organization.

How do we make sure that all the product managers or marketing folks are even don’t go through analyst to get data and insights but instead they do it themselves. Analysts are required for deeper analysis of the data, they don’t have to do the same exact thing everyday. That was one of our primary goals.

The next thing is obviously we want to generate cohorts for targeted notifications, ads, and other things. One of the things that we really need here is, we want the data to be updated in real time. The queries need not be in real time, it’s okay if it’s in the order of minutes, but the data should be updated in real time.

The other thing you want to do is you want to create transitions, events, and sessions. You saw login screen, you converted, you signed up. That’s a transition for us. You’re a free user then you pay it up, that’s again a transition. That’s important for us to capture because lot of our analytical queries are based around this.

Event streams. What did you listen to within a session? What was the artist that you listened to within a session? This lets us train our ML as well as AM models, right? You listened to Rahman in the evening between six and seven PM. Maybe there’s some correlation. We want to be able to mine such data and we want to index all of that.

Lastly, just repeating the whole thing, we have about 20 million miles and 40 minute average session time, which means it’s a lot data, which means it needs to scale. Let’s look at some of the pillars that we have as part of the system. First is our real time pipeline, we use the Kafka Storm setup for it. All of the front end box–