Scaling Beyond a Data Warehouse to Meet Customer Demands
Nathan: Welcome. Thanks for coming out. My name’s Nathan McIntyre, for those of you that don’t know, I’m a Data engineer here at Ibotta. Tonight we’re here to talk to you about how Ibotta has scaled from using that kind of traditional data warehouse towards using the data lake. Okay, so for the agenda telling you a little bit about Ibotta so you can get an idea of the kind of data that we have, the kind of work that we’re doing.
Then, take you through the history of our growth and some of the challenges we’ve met along the way and those that we solved and those that we haven’t. The data vision, so what we’re actually trying to do here at Ibotta with our data team and data lake. After that, I’m going to turn it over to Charley and Heather to kind of give you guys an idea of what Ibottalytics, data analytics team is doing legally.
We’ll wrap it and have some time for some Q&A but feel free to interrupt me if you have any questions during the talk. If it gets out of control we take it offline. Cool. About Ibotta. Probably to give you an idea of the data that Ibotta has. You might know that we’re that app that gets cut backs on everyday purchases.
A little bit more than that we’ve had in the past five years over 22 million dollars. We’ve also established ourselves as a third-row student shopping app. That means when it comes to people using their smartphones and shopping online the only the place that they stay longer is eBay and Ibotta. Our users spend over $5 billion dollars annually and out of that, we process a little over 12 million redemptions per month.
That’s really just to give you an idea of the kind of data that we’re processing and how much we have. We’ve also established ourselves as the number one leader in purchase attribution space. What that means is with the app we can track what a consumer actually does from the time that they see an ad to whether they actually end up buying something which puts us in a unique position that most advertisers can’t do.
Finally, we’re not just a company that collects purchase data on people. We actually give back. Up until this point we’ve given back over 200 million dollars worth in sales and we’re closing in on 250. What about the data that we have? As you might imagine when purchasing data it’s very rich and there’s a lot that you can do with it. Our users, when they use the app they scan the receipts.
We actually have item level data on the entire basket. It’s not just information on the products that we advertise or have redemptions for, it’s the full test. On top of that because it is a mobile app we have access to geolocation data. Where a person is when they actually buy something or when they walk past a store which they have the opportunity to buy and they don’t.
Also, we have demographic data, so when people sign up they give us some basic information and it’s part of the eccentric process. We often things like questionnaires or surveys that you have to answer to unblock and we collect all of that data. Some growths and challenges about the history of Ibotta. We began about five years ago. I think the company was probably what you might expect from a startup, pretty basic.
Back then it was a transactional database. Kind of comes out of the structure of the app, pretty traditional. The analytics at the time was really a minimal process. In order to give our customers information on their campaigns, it was really like a custom process.
People were build custom excel spreadsheets. It wasn’t real time, it was a batch process. It took a lot of effort but it did work for the time. As the company grew, we automated things a little bit but obviously, this is something that you can stay in for much time. Our first attempt in scaling was per se AWS Route.
We’ve been an AWS shop since day one so this kind of a natural solution for us. We started out with Redshift which is kind of Amazon’s de facto big data solution for our traditional type of database. It started giving automated things with data pipelines, started using Looker. It was no longer a manual process we could give people dashboard and reports and all their campaigns we’re doing in real-time.
Again, for the time this did work but it soon didn’t fit what we’re trying to do. Since we’d take Redshift as our data warehouse it quickly became the place where everyone would just dump data. While it was probably first posting simply, the reporting and it’s now, anytime anyone needs to do some analysis so, let’s just go dump the data into Redshift. One of the big problems with that is that compute and storage are coupled with Redshift.
That means if you’re going to use a managed data warehouse and you sticking data in it, more and more data you may still be processing the same amount of service. If you have the same compute needs and now you have to pay for storage. There also it is a cap on how much you can scale Redshift cluster. I think if Redshift does have a use case maybe to use it wisely but it can’t be your end-all, be-all data solution as we quickly found.
To help solve that scaling problem we decided on, we need to move away from this and establish a data lake. Why a data lake? Well, because you can’t use Redshift to stick everything in there. It doesn’t work. As I alluded to before a big piece for us was the decoupling of storage and compute that you need to get the data lake design. Since we were, always have been an AWS shop it made total sense for us to start using S3 as our backing storage.
With that too, we’ve always been a cloud business so the idea of using things like a thimerol cluster also made sense to use. Especially with some of the people that started here in data science and data engineering we had some experience in this and so it was a natural solution to us. There’s also plenty of other reasons to use a data lake taking on your use case.We can talk about that later but it really comes down to that grasp.
Some growing pains. This wasn’t just something we did overnight. We run into a lot of problems. A lot of those problems I think we’ve done a good job at solving. One of them we’re still working on and others we still don’t know. One of the big issues we run into was building a new plane while trying to fix the old one that keeps breaking.
Couldn’t just transition to a database or a data lake overnight. To stuff all of our data into S3 and continue as business as usual, so many things are coupled with Redshift. In addition to trying to scale out this new infrastructure in these new teams we do still have to support Redshift. How we handle that, it’s been an irritative process and it’s been setting the expectation that that kind of migration is going to take time.
It’s still a work in progress. Another challenge to be aware when trying to do this yourself is training for a new tool-set. Not everyone has experience with Hadoop, Presto, and Spark and Hive. You can make the transition is not something that’s necessarily easy for people who have been working for years without a traditional database.
When it comes to Presto and Hive, there’s things you just don’t have to deal with when it comes something as major like Redshift and because of that, there can be some internal resistance for people wanting to move onto a new system. If I’m working with Redshift and works for me, why should I move?
To get around this we’ve done a lot of internal training and support. For instance, one of the things that we found really useful was serving them a Slack channel specifically, to handle queries on board for people trying to move from Redshift over to Presto. They would take their queries from here, which were running fine try and run them on Presto and they wouldn’t get the same results. Those of us with some experience could go to that Slack channel and help them out, and people can search the history so that’s made a huge difference for us.
Another problem with growing this out is scaling the organization and meet supply and demand. As a data engineer here at Ibotta, data engineering is kind of overwhelmed by our Ibotta-lytics department. You have a ton of people– Yes, you guys.
You got a ton of people saying, “Give me this data. I got stuff to do and you only have so many people producing it.” It could be the other way around, or like a huge data engineering department producing all the stuff, they’d know a way around to do it. That is a little bit of a challenge is balancing those out and it’s not an easy thing to solve, so how do you solve that? That’s with hiring and that’s not something you can just do overnight or automatically.
Yes, it’s still something that we’re working with today, but, I think, we’ve done a pretty good job. Lastly, point to bring up is moving to a new infrastructure like this, a data lake infrastructure away from something managed like Redshift, is the need for an Ops team. At Ibotta, we don’t have a– One of your standard, fixed operations team and our DevOps resources are pretty small.
In growing out our data lake solution, we went with Qubole, which really solved this problem for us. For those of you that don’t know about Qubole, it’s kind of big data platform as a service company. Instead of using EMR, we use Qubole. They manage spinning up the clusters and AWS. Scaling them, spinning them down we’re not using it. It gives the ability to launch many clusters at a time of different sizes.
Running Spark, running Hive so we’re not confined to having an on-prem cluster that just runs Hive at a certain version, and it’s a huge pain to get it upgraded. You can spin up as many as we want. As many as we’re willing to pay for.
I want to introduce Qubole since we’re going to be talking about it later on in this talk. The really nice thing about Qubole, on top of managing all of these and providing the support that normally, would have been handled by a DevOps team, is the UI that they have.
Everything is integrated in one spot, so you go to one place to manage your clusters. I think, manage is, actually, a little bit of an exaggeration because there’s not much to it. It’s also the same place that you go to access all your data. Other benefits are the ability– So, because we let Qubole manage the scaling of our clusters, we also have the ability to specify using spot instances, rather than just everything on demand, and we can tune that. That saves a lot of money. As you can see we’re going pretty close– What is that? I’ll skip it. Okay, so the data vision at Ibotta. As we started 2017, we realized that we had a rich set of data and there’s a lot– we realized there’s a lot of interesting products we can build out of it. We have the data and we have a vision of what we can do with it. More machine learning, more predictive analytics, less of just reporting on things after they happen, putting back value into the App, so like recommendation systems, personalization.
In order to do that though, we first needed to scale the organization. You realize that one, we need a data engineering team simply because we didn’t have one. We had regular engineering people producing data and then, a small analytics department but in order to take things to the next level, we realized we needed dedicated resources, and we needed these teams to work together.
Now, that we have these teams together, they need something to work with, right? That’s where our data lake comes into. Now, basically, what our data lake looks like is it’s really S3 for storage and our catalog is in external Hive Metastore. We use Ephemeral clusters and we can spin them up and we can spin them down, but we don’t lose any data as you would on HDFS. That saves a lot of money. We always keep things at RDS instance as a metastore that’s running all the time.
We don’t really have any limits. S3 scales pretty much indefinitely and with Qubole we can scale out our clusters indefinitely, which we simply could not do with Redshift. That serves as the basis for our data pipeline from ingest from the App or from third party sources, partners, all the way into the end-analytics products that we produce. Though, data engineering [coughs] handles kind of the piece of getting all the data into one spot. We call that our raw storage. Basically, what that means is we try to get, keeping things as raw as possible without any kind of transformation, we make sure everything is in JSON. A lot of the data that we have to begin with is in JSON so that was a natural choice for us. We also compress the data just to save space so we’re not paying so much for it on S3.
Once we have these all on S3, in JSON, it’s really easy to just slap external Hive table on top of it. Given that, now, anybody has access to the raw data themselves and they don’t need an engineering team to get that for them. We just need to set up the process, define the table and then, our work is done. As far as the raw table is concerned. Working with just plain-JSON files, compressed data doesn’t really perform well as you might know.
To help things out, one of the processes we view is re-format that data into something more optimal for use, like work. We also, in the process of optimizing that, we can run data-quality checks to make sure anything that we’re releasing to production passes the sniff test.
At the end of that, we make that available through our– make it available through Qubole using Hive or Presto to the entire company with also, the option of loading it into data marts. We use Qubole as a kind of a processing engine and then, just load the results to the aggregated table so things that will be used and things like Looker back into Redshift.
All of these orchestration takes place in a tool we use called AirFlow. It’s also a service provided by Qubole but we have found it, I think, in data engineering hand-in Ibottalytics to be a great alternative to [AWS] Data Pipeline. Cool. Where are we now? Let’s say with our first four years of Ibotta, with all the data that we’ve collected– Now, remember everything has always been stored on S3.
I think the highest that we got to was about 20 terabytes. Earlier this year, we started using Qubole. As you can see since that, our data processing has risen exponentially. I think in the past three months, our processing or the amount of data that we produce, has gone up 4X. Now, we’re pushing over 700 terabytes in just a few months and totally, expect to hit a petabyte this year.
All of these, really, is due to the fact that we have a data lake to process all these data. After all that data has been processed, I’m going to let Charlie and Heather talk to you about what we’re doing with it.
Charlie: All right, you all can here me?
Charlie: Good. All right, thanks Nate. Next up is the analytic section of this platform. I’m going to talk a little bit about how we’re generating all that data that, Nate previously displayed.
My name is Charlie and I assembled the data science specifically focusing on feature engineering. Essentially, trying to take this raw data and the data lake and extract generally useful information out of that and turn those into data products that can be my downstream teams and downstream products. As a data science team, our mission is to build large-scale machine learning models that extract data open sites outside of this raw data, and kind of activate the data.
That can be the mission for many different teams at many different companies, so I’m going to talk a little bit about some of the work we’re doing at Ibotta and how we do it. Nate spoke a little bit about all the data that we collect. One example of that would be the tens if not hundreds of thousands of receipt images that we get on a daily basis and we can use computer vision to turn those receipt images into item text.
Sometimes our users don’t like to take great pictures of the receipts. Maybe it’s gone through the washing machine twice and it’s all wrinkled and they’re trying to straighten that out, so the computer vision can sometimes mis-categorizer or miss many fields on that real data. In this instance, we can use things like natural language processing and machine learning to take maybe a stream that says like, ‘B-A-N’, maybe OCR couldn’t tell that that was actually banana, we can use machine learning to say, “Oh, this is most likely a banana.” Maybe it also miss the price on that real data and we can say, “Based on this retailer this time of year and this year geographic location, what is the average price of bananas?”
Another simple project we could work on is imputing demographics. Maybe when people register the app, only 30% of them actually fill on their whole demographic profile, but that data is actually super useful in downstream products, for things like recommendations, or maybe in-app customer segmentation.
We can basically also train models to say, “Based on the information that we get during registration, maybe some of your prior purchase information and maybe some geolocation information, we can make predictions on the probability that user A is male versus female, or user B is Caucasian versus African/American or Asian.”
Then finally, we could do things like category predictions. Obviously, a big piece of Ibotta is trying to serve the right content to the right users. This would be taking the hundreds of categories that we have inside of our app, whether that’s milk, produce, canned vegetables, et cetera, and making predictions on the likelihood that every user will go out and purchase those categories in maybe the next 90 days, and making those predictions on a daily basis. How are we actually doing that? We’re languaging Qubole really really hard. Some of these folks are using mostly Python, a little bit of are, but we’re using that strictly deployed in Spark. Using PySpark or SparkR, which is great, and basically, as our data grows, we can scale our models and predictions with that.
Similarly to what, Nate said, for data retrieval and storage we use Hive. Once the data lake is built, we can read in that database in an optimized format, train a model, make predictions on top of that model and store that back in the Hive metastore.
Then finally, we additionally use the Airflow, which we’re finding to be a great tool. A) because it’s got an integration with Qubole, B) it’s got a nice UI, but C) it really allows us to handle dependencies really well. Maybe there’s three tasks, task C, task B, task A, and maybe task B depends on task A. This allows us to say, “Okay, don’t launch task B, until task A is completed.” The previous way of us doing that was, “Okay, we can maybe use data pipeline,” or maybe we’re just like having a single EC team machine in the cloud and putting chromes on top of each other and crossing our fingers that job A finishes before job B, so Airflow has been really nice.
Then the one last benefit for this is, the data science team is actually able to productionize all of our models ourselves. At many other places, perhaps a data scientist is building a Python library or Python module and throwing that over to engineering and say, “Hey, optimize this, turn in to C, and maybe write in Java and hope that it gets this production someday.” That’s not happening here.
We’re building these models as soon as we put an approval class and it gets approved, we can stick that exact Python function into Airflow and we’ll start deploying that on a daily basis moving forward. What does this look like in practice? Like I said, we’re trying to build more of a data platform than just a data lake. our job is to really provide this enhancement layer in between that.
On the feature engineer team, our goal is to take that raw data and extract generally useful information from it and then provide that useful information throughout downstream products and give that access to downstream teams as well. I’ll just walk through an example of what a project or pipeline may look like. Maybe at Stage One we have some preprocessing. This is what I alluded to earlier, maybe we have some messy data coming out of computer vision, we want to impute the correct prices, quantities and categories using models that we’ve built.
The next stage would be to roll all that data up to the customer level. Maybe for user A, we’ll look at their previous 90 or years worth of data, and calculate things like, “How many retailers have they been to? What’s the total amount of spend?” But more important start to capture more predicted fields in a more sparse format like, “What percent of time did he spent at Walmart?” Versus “What percent of money did he spent at Target?” Versus “What is your share of wallet in the produce category versus the canned goods category?” You know, we can make thousands and thousands or tens of thousands of these different attributes and store those in these complex data types inside of Hive.
Once we have these roll-ups ready for machine learning algorithms, maybe we could do something like combined clean transactions, build this transaction roll-up, use some other data maybe like computer demographics to make these category predictions. For every user, what’s their likelihood to go out and buy every single category that we have in our database on a daily basis?
If a user came out and we see a new receipt from them, we’ll categorize all that information, that will flow into their transaction roll-up, we’ll make a prediction for that on the next day. That can be a stand-alone product that other teams can use, but it can also flow into into downstream products. On the other side of the data science team, we have our recommendation engine team.
They’re building large scale collaborative filtering engines to basically say, “For the rebates we have in the app, how relevant are those to every single offer that we have, or every single user we have in the app?” You can combine those user and offer relevancy scores with these users and category prediction scores to produce these really powerful in-app recommendations.
That would be something like you might on Amazon as, “You might like,” or Netflix as, “You might be interested in this film next.”
In order to build these type of products, there’s obviously some infrastructure and technology along the way, the first of which was Qubole, and this enabled us to work on this type of projects. Many of us on the team, we’re kind of used to working with Python, maybe single EC2 machine and Redshift.
It was kind of a steep learning curve moving from those systems up in the distributed computing, not only from learning the new languages, but more of just a mind-shift second how we actually build products. We were up to speed within the first month, we were pulling data from Hive and building products on top of it in Spark. That’s this kind of this utility library, this makes our day-to-day work really easily, so the thought behind is, there’s things that we interact with on a daily basis and we don’t want to be running redundant functions to do that.
So, having a utility library that maybe interacts with the Qubole API or helps us do input and output S3, is really helpful for us, and making sure that we have some testing behind that. Again, I lead it to Airflow for a patch deployment process. One additional benefit to this is we can actually have dependencies our team but dependencies are cross teams too. Maybe, Nate and his team are pumping data from mySQL into into raw format and then eventually into Hive format, we can have a sensor that once we see the data that’s sitting in the Hive metastore that’s ready for analysis, that can actually launch the jobs on our team to take on the entire pipeline, which is really nice.
We do have some of these larger batch processes that may take a couple hours or even half a day. Over the next quarters, we’re going to be looking into doing some research and maybe start some development off streaming technology for that type of data.
Finally, we hoped, we have transformed this data lake into more of a data platform. When you look at the life-cycle of that platform, it serves as a foundation for not only creating a data-driven business but a data-driven culture. You can see here, Nate’s team is working at taking all these different data sources, ingesting them, putting inside the Hive metastore and then it’s our teams responsibility to build products on top of that raw data and add an additional layer of value.
So, that in S3, we can leverage on keyboard really hard, use Spark, Hive and Airflow to build the products on top of that and then we can expose those to different end points that are eventually consumed by our different users or customers. Maybe we will put those back into S3 and Hive but we can also shift those out to different data matrix put those down into our Redshift cluster so Looker can speak them but this gives us the ability to have a platform that you can do really nice analytics on top of and, Heather is going to speak to how we get that information out to our end-users of the data.
Heather: You guys can hear me okay?
Heather: Okay, great. Nate talked a bit about the data lake and how we got all this mass of data and Charlie talked about some of the the analysis that we did on top of the data and I’m going to talk about how we basically deploy that information to the masses. Here at Ibotta the way we do that is through Looker. Looker is a business intelligence software that allows you to explore, analyze, and then deploy data business matrix for basically to the masses.
This here is just a quick example of what we call a dashboard. It’s just a collection of different matrix and it’s what our end-users see. Each of those matrix we basically call a tile or data tile. At Ibotta, we have a pretty well developed Looker environment, we have about 1500 users, about 580 monthly active users, about 21,000 monthly visits, we have 355 tables well, Looker what they call tables.
Tables can be of a couple of types. First type is your traditional database table that we let some Looker code on top of and then turn the columns into dimensions and measures. The second type is what we call persistent derived table or PDT. Basically what that is, you can think of it as a tradition database view rowing right code on top of existing tables to create a different table with different measures and dimensions. Those tables power about 145 dashboards and around 25 of those dashboards are for our external clients. At this point, Looker talks to Redshift in our environment.
That’s an idea about our Looker landscape but I want to talk a little bit about the benefits that we get from using Looker both internal as well as external benefits. First for our external. The dashboards we provide our clients allow them to ensure that they are reaching the right customer at the right time but also allows to them to hone the customer whom they are trying to reach.
As, Nate talked about earlier, it used to be that there were some Ibotta employees that spent hours a week manually collecting metrics, putting them in excel and sending them to clients. That’s just not possible anymore with the sheer volume of partners that we have and it even took a ton of time back then. Now, with our Looker dashboard, when we get a new partner, we can put some inputs in for their particular cases and then they have their own dashboard that they can go look at.
The can refresh it anytime they want and it has a standard set of matrix for them to see how their campaign is running for example. These dashboards are standard. We’ve developed dashboards once and then for each partner we can just input the information that is relevant for them and then ship it off and they have it available.
It takes minimal time to deploy that and in fact, John over there in the back has in the last couple of weeks completely animated this process so when we get new partners, it’s just at the click of a button and it automatically sends it out. It was a ton of work for him but it’s saved hours and hours of time.
Internally in Ibotta, one of our core values is transparency. What this means in one way is that employees in all levels of the organization, have insight into what is going on at the company as a whole. They know what the company’s goals are but more importantly, they know how they are performing against those goals. With Looker we are able to provide dashboards that show these KPIs so anybody at anytime can go see how the company is performing.
There’s no, “Let’s wait till the metric is good and then share it with the company.” It’s, “Anytime we want to go see, it’s there for everybody.” It also means that everybody is looking at the same metrics. There is no, “Change, well, our department wants to see it this way and our department wants to see it this way.” No, this is how we’re doing and everybody is going to see it this way.
The dashboard also enables us to not just shape-up our internal customers numbers and say, “Well, go figure it out and have fun with that.” We actually present, and you saw it in the dashboard example, really great graphics so they can quickly see whatever metric they’re trying to measure and how it’s doing. It provides with actionable knowledge. Something they can take and go do something with. Then once they do something with it they can come back and and see well, how did our action affect these same metrics?
Not only does Looker provide the functionality of these kind of pre-created dashboards that are done by my team, it also allows for any individual user to go and basically create their own information view. We’ve provided tables with dimensions and measures. They can go into the Looker UI and actually pull those dimensions and measures that are interesting for them.
Our team is not really bogged down with request of, “Oh, I think you will be interested to look at this.” Teams can just go do it and that allows for a very self-sufficient teams. Now, three of these points combined allows for analytics to be a foundation at every department here. We’re a data-driven organization and everybody has access to the raw data it’s not the domain of just the engineering or just the analytics department but every employee at Ibotta.
Which means that we all have the ability and thus responsibility to make data-driven decisions. There is no, “Well I feel like this will be a good thing or I have a gut feeling. I think this will work,” but you have data to back it up and then you have data to back up if it did or if it didn’t work.
Up to this point, we’ve had Looker and Redshift and it has served us really well but as our client teams continue to grow and their questions continue to grow and even more importantly the sophistication of their questions continue to grow, we are starting to out-grow this implementation. Some of the issues we’re running into is that the dashboards are just taking long to look.
There’s a lot of data, a lot of dashboards and they all that just bogs down the system. If you think about the PDTs that I mentioned earlier, since the questions are more sophisticated we do actually have to create more of these PDTs that the data just isn’t in at the tables immediately. With that, we’re just having a huge explosion of these PDTs which every time one of these runs on a particular load on our resources and that just obviously increasing the need for more resources and really just becoming unsustainable.
Finally, the sheer number of dashboards has just been getting difficult to manage. You can imagine with 145 dashboards there is definitely duplication of data out there. Also, it’s hard to know what data is where, what metric is in what dashboard and even on our team we are the stewards of these dashboards. We don’t always know what data is where, we have to go do a manual search and that all takes time.
Finally, as we’ve included more and more data, we maybe not necessarily provided any more actionable knowledge but just more data for the sake of data and we’ve actually maybe decreased clarity in some cases and so we’re going to be doing a complete Looker re-design from end ten. This is a huge project for our team we’re really in the planning stages of it though we hope is at the end we’ll have a more streamlined Looker but at the same time our internal customers will actually have better access to more data.
The way we are going to do this is we are going to involve more of our ETL jobs entirely and then in two weeks we are going to create very wide tables. For a given measure we are going to have maybe 100 different dimensions or customer row. We’ll have all the information about that particular customer. Then, we are going to use Presto onlooker to query that data. Since those expensive joints will already have been done with these wide tables, they’ll return the data much quicker, decreasing the time to load the dashboards. Over on the UI side, we are going to decrease the number of dashboards. We are going to obviously continue to do our KPI dashboards as I mentioned before, we want to keep that transparency and consistency.
We will probably still have some dashboards for SLT reporting. Then because of the wide table our internal customers are going to have access to a lot more data, it’s going to be more raw data, so they are going to be able to have more ability to slice and dice that data as appropriate for them. If they want to see data a certain way it’s not going to have to be a request to our team to go do the pull, they’ll be able to do it themselves. Then finally, we are going to upgrade to Looker 5 and we are going to be able to use all of the new features that come with that version.
Nathan: Thanks Heather and Charlie. I just want to wrap things up. Point out some things I hope you can take away from this talk. Since we rolled out the data lake and the data team, we’ve been able to prove 7% lifted revenue inside the app using our AB testing framework. Guess where that framework lives? Nowadays the data team doesn’t have any scaling issues. We are not constantly fighting with Redshift in order to do more, still supporting it, but a lot of the burden is gone away when it comes to scaling out.
We have also had a chance to tackle some really challenging, really interesting problems like Charlie was talking about. Pre-cross team collaboration. A lot of these changes weren’t just about a shift in technology or infrastructure. It’s about setting up a real data team. It’s not an us versus them. It’s something that we work together every day, and it’s something that’s a benefit for me personally.
A huge AWS bill issue, you probably saw what happened as data was growing. With that though, we’re scaling smarter with Qubole. In scaling we’re going to need more storage and we are going to need more compute but by going through Qubole we get a lot of the automation from them. We get a lot of the support, and that’s worth it. That’s about it for the presentation piece, if we have some time for some Q&A. Who’ll ask our first question?
Male Participant: Are you hiring?
Nathan: Yes, We are hiring.
Male Participant: We are hiring data scientists, platform engineers, data engineers. If you want to work on cool problems come talk to one of us, absolutely.
Male Participant: Hey guys, quick question. From inception to completion how long would it take to completely transfer from a data warehouse or whatever?
Nathan: Maybe about 8 months. I can say that it’s never really done, but version 1 I think we are pretty much there. Still, some things we need to do but, yes.
Male Participant: Less than a year?
Nathan: Less than a year, yes. Totally, keep it up to that.
Charlie: Even within the first month of getting on boarded with Qubole, we didn’t have a full infrastructure in place, but we were able to actually start building products on the Data Science team. Maybe not in the most optimal process but just pushing data into S3 and running mods with Hive and Spark. That within the first month that we on boarded with Qubole.
Male Participant: Which process are you following to automate your current dashboards and local?
Heather: It’s a huge Python script.
John: Sorry, can you repeat the question?
Heather: What process are you following to automate the automatic dashboard creation.
John: A lot of the data lives in MySQL and we’re using an engine that we build in Python to grab that data out of MySQL, hit local API, build all their access filters and define what they have access to in them. All that’s done through python. That answer your question?
Male Participant: Yes.
Heather: Thanks, John.
Male Participant: Regarding PySpark. I know that there are some limitations to the Spark API they come along using PySpark as opposed to Scala and so I assume that you are just using spark- submit to submit to a cluster that Qubole runs. Have you run into any of those limitations? Are there any quicker pain points for that or is this something you haven’t considered yet?
Charlie: We do have a couple of people on our team that are using the Scala implementation especially more on platform engineering side. From the data science perspective, we haven’t really run into any major issues with PySpark. It suits us really well since PySpark has a ton of good machine learning libraries and even we are able to do some non-distributive training and then deploy those models with python. So, many people will create a scikit-learn pipeline, train that in our distributive fashion and then you can distribute that model object across every single one of the executors to make predictions at a much larger scale.
In terms of issues that we are having with PySpark, no that’s not an issue for us and we are not really looking to move away from Python into Scala for a majority of our use cases.
Male Participant: Why wouldn’t you want to distribute the learning?
Charlie: Yes, the a good question is why you would you not want to distribute the machine learning algorithm. We found that if you use things like MLlib as compared to scikit-learn you’ll actually get much higher– Typically we don’t need the trained models on more than maybe a quarter or a half million observations and we’ve seen a much better result when training models in a non-distributed fashion.
The benefit of that kind is you can have one single object stored in S3 and let’s say you train half a million observations but you want to make predictions on a billion. On a billion rows you can just load that object that’s hanging S3 distribute that to all the executors that you have and pipe data through it. I think that’s just from past experience. We’ve seen quite a significant dip in our evaluation metrics when we are using Spark ML.
Male Participant: Hey guys, great job. You mentioned earlier on you have, on your roadmap considering some use of the real-time streaming analytics without disclosing too much of your private plans so to speak. Give me a little picture if you can, of what a use case at Ibotta ought to look like for use of real-time data.
Charlie: Yes, I think I’d be the best person to describe this. Let’s say that we have a ton of sort of data in a data store. Maybe more like a key-value data store like DynamoDB and we want that data accessible for us in Hive, maybe rather than unloading that entire data table from DynamoDB into Hive. Maybe just like having a Sparks streaming that is able just to stream the diffs into that new table rather than making a snapshot on a daily basis. I don’t know if that answers your question but I think that’s more.
Male Participant: A lot of our clients are running ad campaigns, do you want to wait till tomorrow to see results or do you want to start seeing them right now?
Charlie: A big case for us from the data science team could be real-time recommendations. Let’s say you go into the app in the beginning of the day and you want to see what’s in the app and maybe unlock a couple of rebates, you actually don’t go shopping until the later in the day. Rather than making recommendations on a daily basis we would like to take the most up to date information that we can and personalize the app based on your latest information.
That is very very important especially if you are a new user, let’s say you just opened the app. You kind of go through, there’s different onboarding flows compared to what percent you may go into and being able to personalize that in real time could be a big advantage. We think that there could be some fruitful beginnings there.
Male Participant: I guess piggy backing on that a little bit. Are you mostly doing batch learning right now, and so you don’t really have an emphasis on online learning models as it stands?
Charlie: Yes, the nature of the stuff that we’re doing right now works really well for batch [jobs], typically we gather data on a daily basis, and then we can make the most up to date predictions the next day. That works really well for the data science team currently. Again, we only really started building these types of processes six to eight months ago. I think there most likely is a use case for streaming in the future. That’s what I alluded to, it’s probably start into more research phase over the next quarter or two and then seeing how that might fit into production in 2018.
Nate: What else?
Charlie: Cool, well it sounds like we have the fireside chat up next, so If there’s any questions that you guys think of, we can maybe discuss during that.
Andy: If we’re going to use this launching off point for more questions if you have anything, so please try to make this interactive and we’ll just– You’re looking for something.
Charlie: No. It’s good, I was seeing if there was another microphone. Now we have three.
Andy: Yes, what do you need? Come on man.
Andy: If you have questions, just jump on it. Let’s get started. I think we’re going to start with just talking, just going around the list of everybody, please introduce yourself and what do we want to do? What do we want to say? Quickly introduce yourself, who do you work for, and how you use big data, any highlights on what you learned along the way.
You’re making a face like that’s a brand new question, which is great because now we’re going to get a great answer.
Ron: That’s a broad question. My name is Ron White. I am the VP of engineering here at our host ibotta. I’ve been with the company about four years and I was the original newb who set up Redshift many years ago, and now I oversee a bunch of this crew here. I do wonder where is the fireside? Where is the fire? [crosstalk].
Andy: Do you really want me to start a fire? I’ll do it if you want me to.
Heather: I’m Heather Trujillo and for the business analytics team here at ibotta, specifically on the business intelligence team in Looker. I’ve been here for five months, and, boy what have I learned? Gosh, so much. Here at ibotta, I think what I’ve learned the most is what a great partnership we have between our customers and our data teams. We all work together and there’s no us-them environment and that’s helped us to get to where we are now.
Nate: Hi, I’m Nate McIntyre. I’m a data engineering lead here at ibotta. I started back in March. What have I learned? Scaling this stuff out, it’s more about technology. I’m an engineer, but I tend to think that the technological piece is really simple. It’s bringing people along with that technology which is a difficult piece that I overlooked at first, but yes it’s something I’ve learned.
Ben: It’s a great story, right? These ibotta guys, they’re doing such great stuff. My name is Ben Roubicek, I’m a solutions architect for Qubole that we’ve been talking about quite a bit here. I just joined the company about a month ago. Prior to that, I was actually working for a customer company of Qubole’s actually, a different one, as an engineering manager and an engineering architect. I did that for 15 years before this.
I really have enjoyed working for Qubole for the last month. A couple of things that I’ve learned, in going back into my career, is this daylight stuff is actually really important. It’s important because without being enabled, without being empowered with data, it’s very hard to actually doing. It’s hard to be future-focused. People get really bogged down with operations and with projects tied to operationalizing the data. The big thing that I’ve taken away is really been more around flipping the model so that teams can be focused on features, like these guys are here today and instead of so much operations.
Charlie: Thanks. My name is Charlie Frazier. I sit on the data science team, specifically the feature engineering team. I was one of the first analysts here at ibotta, so I’ve kind of learned everything that I do know here at ibotta, from Python to SQL, and now to the new big data tools. I think the biggest learning for me is that a lot of companies have a lot of data, but they have trouble finding the value inside of that data. Over the last eight months, we’ve really seen things come together. As Nate alluded to, the combination of the work that we do in the feature engineering team, leveraging the work on the recommendations team. We launched our first AV test and saw a 7% lift in revenue inside of our app, and that’s huge value that’s paying well worth the AWS bills that we’re paying. That was a little about me.
Lucas: Thanks. I am Lucas Thelosen, I’m the VP of professional services at Lookout. We’re out of Santa Cruz, but I’m in Boulder, CO so close by. Five years ago, I was across the street here at Craftsy, and we were celebrating our 4-node vSAN cluster, and I was super excited.
Lucas: We celebrated our 8-node cluster, and with 12-nodes, we started to get worried. I have since helped a couple of companies like Uber or Snapchat or Signa, a couple of the companies that have some massive data opportunities and joined– Looker is actually what I got into place at Craftsy for us to be able to switch databases without having to switch out the front-end, the visualizations and the portals that we have for our customers or vendors to access and see how things are performing. I got really passionate about that data model, like Looker in-between, to make that possible to have flexible back-ends. Since then joined, Lookout, so I’m excited to be here.
Andy: Cool, I have a couple of questions we need to get started. If anybody has anything they want to interrupt with, please do. Actually, this one was originally for John but he’s not there so Nathan. Nope? Nathan gets it, sorry. We were talking about this earlier anyway but separating computing storage. I think you touched a little bit about it on your presentation as well, but why has it been a big deal to you and what has it enabled you to get?
Nate: It’s almost something I take for granted now. It makes a big difference in scaling. That is the main reason that we want to move away from Redshift as the data warehouse. We have found the need at least, to scale storage much quicker than we have compute. Why pay for both when we only need to increase one? The other reason for separating the two, this is something else I take for granted, is the ephemeral clusters that we get from Qubole. We’re not processing base 24 hours a day, seven days a week, so there’s no real reason to keep those compute resources up and pay for them. By separating those, we can power down the cluster and we don’t lose anything.
Andy: Cool. Ben, do you have anything to add to that?
Female Participant: How much did you save by switching Redshift?
Andy: Switching what?
Female Participant: How much did you save by switching from Redshift to Qubole dollar-wise?
Nate: I’m just an engineer.
Ron: It’s been a while since I did this calculation, but we were initially looking to roll out the data lake. I calculated it based upon the storage as three versus the equivalent size we would need in Redshift, and it’s like about one-fiftieth of the cost. Of course, we’ve still got a Redshift cluster running so we’re still paying for that. Logistically, there’s big savings.
Andy: If I remember right, you can’t spin up Redshift up and down – like you can Qubole, right?
Ron: No, not at all.
Andy: I might not know, I’m sorry.
Ron: It’s there, it’s not there. The bigger it gets the longer it takes to resize, the longer amount of time you’re in a read-only degraded state. It gets tougher to work with the bigger you get.
Andy: So, you’re committed to it there, where with Qubole or all of the AWS services you can kind of spin up right now. You had a question.
Female Participant: Does Qubole spin up Kafka clusters for streaming?
Nate: Can you repeat your question on the mic?
Female Participant: Does Qubole spin up a Kafka cluster for streaming?
Ben: Qubole currently does not support Kafka directly. So, the streaming support that we have today is like with Spark Streaming, where you can attach to a Kafka cluster and stream data through your application and perform analytics or ETL on that data.
Female Participant: You have to do your own Kafka clusters?
Ben: Yes. Can I add? So, on the storage versus compute thing, there’s the cost side of things, which is really compelling. There’s also another way to look at this too, which is around performance. People often think, “Wow, I’m putting my data into the cloud, and it’s not node local as it is on my Hadoop cluster,” and a lot of customers, and rightly so, are concerned about performance. What I will say though, is that in the last five years or so, columnar formats have come a long way. So, formats like Parquet and ORC and Snappy Compression, combining those together, has made tremendous improvements to what you can do in pulling your data out of S3.
Real quick, for people who may not understand what that means, when you store your data columnar, you’re storing all of your columns of data sequentially. That means that if you want to, let’s say, access the gender field, that you can seek to a particular point of a file and read just a section of it. So, if your queries, and this is where Heather was mentioning about having really really wide tables, the reason you do that is because most queries don’t need 100 dimensions, they only need four or five or four, five, or six, and that’s only on smaller portions of the file. By doing so, you dramatically cut down on the amount of I/O that you’re pulling across the wire. So, of course, if you’re using CSB or JSON, you’re not getting any of those benefits, but if you’re using columnar, you’re getting those benefits, and it’s tremendous.
So, using S3 as a storage back-end really does pay off when you’re using columnar formats beyond the initial ETO. It really makes this whole story pay off for customers that do the separation of storage and compute, and still actually put Presto and even Looker on top of data that’s actually sitting in S3 without any caching along the way and still make it useful for customers.
Andy: Thanks. Lucas, I have you down as somebody to make a comment. What do you got? I thought you might be a little surprised, but I saw your name here, so, whatever.
Lucas: Okay. My comments on separating storage and compute?
Andy: You’re here, so–
Andy: You’re welcome to pass. You’re welcome to say, “Pass.”
Lucas: No. No, we have seen that problem a lot of times where the data just grows with the company and, all of a sudden, you’re paying for compute that you may not need, or vice versa, you need more compute, and you don’t need all that storage. So, I think separating it just totally makes sense, and I think that’s where the majority of the market is going. It’s, a lot of the deed offerings are going with that.
Andy: You totally had a good answer.
Lucas: Yes, I wasn’t prepared, but–
Moderator: We got a question out here.
Male Participant: I was actually just curious, does ibotta use Parquet? Okay, yes. Thank you.
Charlie: The data science team uses Parquet. The data entering team uses Orc.
Andy: I will tell you that I don’t know of any team that uses all the same data format, which is very strange, but–
Nate: They’re both good.
Andy: Yes, they’re both very good.
Charlie: [whispers] Parquet.
Andy: Yes, I think it’s a bit of a religious battle, so, whatever. This is sort of an open-ended question to whoever wants to answer it, but there’s this term data lake that comes out a lot, and I don’t know if a lot of people know what it means, including myself, but what does it mean to you, and how is it helping with what you’re doing?
Nate: I guess I’ll take the first whack at this one. So, it means a lot of different things for me. Some of the key points are having a centralized place for all your data, so you’re not accessing this database or another one, using different types of SQL, whether it’s high versus standard SQL, or you have to write custom jobs to access the data. It’s all in one place. Another key component means that the raw data is always available to you. You’re not seeing some end product that maybe has been rolled out or munched in some way, you always have access to that raw data. There’s a lot of things that come with that. There’s the data piece, and the governance piece, cataloging of data I think of as a data lake, but yes, who else?
Andy: Ron, you’re nodding your head crazily at the whole idea of having all the raw data.
Ron: Yes, I concur with what Nathan said. It really is all about having raw data, but very accessible, so you can get to whatever answer you need to find.
Andy: Is that mostly if you’re looking up new answers, or recovering in the cases of an issue, or all of the above?
Ron: Both, all of the above, but really the option and opportunity to answer questions you may not know you need to ask in the future.
Male Participant: If you don’t mind, I’ll chime in from the peanut gallery. One of the nice things about having a data lake is your data is, not only in one place, but it’s accessible by all the teams almost immediately so. You’re not writing your data to Redshift and, by SQL, and having to transport data around. It really helps unify the teams and get them working in concert even faster.
Heather: From a reporting standpoint, it’s great that we can have data, and go back as far as we need to go. We sometimes we get questions of, “What was happening two years ago?”, and with the data we don’t have to worry about, “Oh, well that data is archived somewhere, and it’s impossible to get it back”. It’s there for, basically, as far back as we need to be able to see it, and it comes back quickly.
Charlie: One big thing for us is the idea of write once, read anywhere. So, once the data is sitting in S3 or Hive, we can use Spark on top of that Hive on top of that Presto on top of that whatever we see fit. In our team, that’s especially powerful because we need to utilize all three of those technologies.
Andy: Well, Lucas, sort of paraphrasing the question a bit, but how do you deal with access controls for accessing the data in the data lake?
Lucas: Yes, that’s a little bit paraphrasing it. The question I had in mind was-
Andy: I’m sorry. I can ask the question the way you want me to.
Lucas: No, actually you had the question-
Andy: I’m going to go back and ask the question. The question is, we talked about the value of data lake. How do you prevent that only people in the node can use it? I take that as access control, but whatever. Go ahead.
Lucas: No, access control is a good one too, along the lines. What I was more thinking, in terms of what’s the very transparent company, like Ibotta here, but where it’s really not. Some companies are like, let’s say, a certain company, they want to go public, that’s a question. There’s something that you need to have in place in order to do that, and other insider trading happening. There are reasons why in Healthcare where governance will help there, and you need to cut down on data lake access.
That’s really important, and along the lines of that, how do you get the value out of it really quickly for your end users? Where the data science guys, they love the data lake, most people in the analytics world, they do, but then there’s also just the true end user, like the head of marketing. They may not know really why they should care about the data lake. I think that’s where, I mentioned earlier, it’s important to have this quick data model that you can put in-between. This is so people can, as I said earlier, on the dashboard you have this little red tile, and you want to know why it’s red, or the dashboard shows you what is going on, but you want to see the why, you want to be able to drill down and go to the raw data. That’s why I’m so excited about having that data model in local, where you can drill down, you can go to the raw data, you can explore, you can figure out why it is happening, and sometimes your data engineering team is already on top of it, and has it fully modeled out, and sometimes you can use a prototype at first and you have the ideal of choice.
Andy: I totally did not paraphrase that, right? That’s a much better answer.
Lucas: Yes, and then you can, at the same time, with that, you can put access filters and whatnot and all kinds of layers, if you need to, but luckily there’s tons of cool companies out there that want their people to explore everything.
Andy: Fair enough. Withdrawn. Actually, this next one is interesting. I think Nathan and I talked about this a little bit too, but how do you go from people that have been using things like Red Shift and all, it kind of feels good enough to them to start using some of the big data technologies? You know, it’s not going to be an immediate benefit, but there’s going to be benefit pretty quickly, and they’re going to leapfrog where they’re at right now and how do you get them to buy into that?
Ben: I’ll go first. So, I’ve only been on a month, but I have worked with quite a few customers already, and for most of the companies that we see, they’re feeling pain, they’re feeling pain in some way. They’re having some problems with access or with slowness of projects getting done or just general confusion about what I have access to and what I don’t. The approach that we take, and we think works really well, is really do a use case focus. When you look at on-prem or you look at bigger installations like Redshift, you’re really trying to be the biggest hammer, you’ve got to build a big hammer and put that in place so you can solve, you know, whack all the nails.
In cloud, it’s just not that way. The use of the data lake is efficient, it’s the most efficient use of storage that we know about. Your clusters, your compute, can be tuned to your use case. A lot of customer start with the pain and then, the solution is, “Well, let’s not just re-architect everything. Let’s pick a use case because we can be efficient at that.” The data lake architecture and the separation of storage and compute enables you to pick and choose use cases without necessarily just torching the farm. That’s a really powerful way to start to leverage data lake architectures and Qubole or just the femoral clusters in general and to get, to sort of start to extract value out of your data.
Andy: Heather, you’re nodding like crazy.
Heather: Yes. I can speak from the perspective of a person that did not want to shift from Redshift.
Heather: One of the things that I think made a huge difference here, was having folks like Charlie and Nate who basically evangelized using these different tools. They’re passionate about it, they know it well, but also they’re willing to teach the rest of us. They’ve spent a lot of time answering questions in the Slack channel, meeting with us individually, and I think that is a huge benefit. We have people that are willing to help. It just seems less scary knowing that if you don’t have all the answers right away, there’s somebody that will have many of them. I know the folks at Qubole also have been super helpful in making the transition.
Ron: From my perspective, my level, it’s actually incredibly easy. We’ve got good leadership. Our CTO, he went to re:Invent last year, and there was a lot of talk about the data lake. He came back and he knew we had a lot of data that we just didn’t really know what to do with that we were just dropping on the floor, literally just setting it aside, not making any use of it. He said to me, “We need this thing, a data lake. We need a data lake, we need to move forward with that,” and I went and built out a team to go build that. I had that mandate to go, “We know we need this”. We made great use of it with Charlie’s team, Heather’s work, and Nate putting it together and there was no doubt later.
Andy: The next question I have is around going from a prototype to a production system, this actually came up when we were talking about this before. It seems like it’s super easy on things like Qubole and all, to play with an idea and come up with something good. How do you transition that into a production system? Nathan’s nodding his head, what can you say about that?
Nathan: I’m nodding my head thinking, “Charlie is great for answering this”.
Andy: Yes, I meant Charlie.
Charlie: Yes, that’s a really good question. I think the biggest thing for us is Airflow, I think that is the biggest piece for us here. We’re able to kind of view our scratch work all on Qubole, maybe on some sort of– Since you can spin up clusters very easily in Qubole, we can have a dev cluster and a prod cluster, or a dev backdoor cluster or a prod backdoor cluster. We can do all of our scratch work in our own database that’s inside of Qubole, not quite using our own clusters yet, but by sharing clusters among the team. We’re able to basically do a bunch of scratch work, test the whole pipeline, and staging, and then when we’re comfortable with some sort of process that we’re ready to move to production, for us it’s as easy as wrapping that an entire pipeline inside of a Python function, and then wrapping that pipeline function inside of a Python operator. Now that is a task inside of Airflow.
One challenge that we’re kind of working with right now is how do we come up with a staging Airflow instance. Rather than just going from, “Here’s this thing, how do we just put it right into production?” We would like to get to a world where we have a staging Airflow instance. When we put something into production, it’s not running through the 100 or the couple hundred jobs that we have. It would be great to build a DAG that just has the process that we’re looking to get into production and maybe what those depend on, stage it for a few days, make sure everything works, and then put it back into the actual prod. We kind of have true dev, dev prod, and hopefully eventually have, real prod. [laughs] If that makes sense?
Andy: It does, but it sounds like for you largely the whole Airflow component has been very helpful for moving from development to production.
Charlie: Yes, Airflow is great, and it’s really flexible too. You don’t have to go all in an Airflow and say, “We’re only doing things with Airflow”. The way our team uses it, it’s basically, we develop everything in Python. We can a turn a Python function directly into a Python operator, and Airflow will deploy that that way. Maybe a new technology comes out that’s better than Airflow, or we decide for some reasons Airflow doesn’t have a specific functionality that we need. We don’t have a huge amount of tech debt, we still have all of these Python modules that we’re able to easily transfer over to maybe a different job scheduling platform.
Andy: I have three more questions here. Are we still doing good on time?
Ben: I was going to have one more thing to that one.
Andy: I’m sorry. Yes, I mean we have three and a half questions left, because he’s going to have to finish answering this one.
Ben: What I would say, it’s the development test prod cycle for every company is extremely different. For most people though, it’s a function mitigating risk. The more you’re able to consider things like A/B Testing or Blue-Green deployment, things like that where you have a lot more control around the exposure of data and features and learnings and all those things, the faster you’re able to get things turned around. By the nature of seeing an A/B you’re able to get really early warning feedback about what’s going on. Charlie explained how he’s kind of using Qubole for that and Airflow, but I would say from my data perspective, and just from an architecture perspective, the most successful teams that I’ve seen so far around that are of using data machine learning, are really taking A/B Testing to heart.
Andy: Cool, thank you. Lucas I was going to ask, have you seen companies succeed in getting businesses’ users to use big data? Getting business users to use big data?
Lucas: Yes, that comes back to the idea of, starting with the dashboard, that’s performance and then being able to drill and be not lost. I think the challenge with the data lake potentially, is where does the business user who doesn’t know all about it go? Having a fairly great experience where they can go and explore, and ask questions themselves, like questions that the engineering team may not have thought of yet or the analyst team may not thought of. I think that’s really fantastic, where they really see the value and are willing to invest more into data infrastructure. If they really get that.
I think I bought us in a really lucky space compared to some other companies that I’m seeing, where you really have leadership bought in like, “We are investing in this one. We want to do this.” You have the goals literally on the wall which is amazing. I don’t see that very often. Lots of people in the leader space in companies- company’s say they want to be data-driven, but to really get there, they have to push. I think, getting executives, getting non-naturally data people to be having very quick access to the questions they want to answer. That’s really key.
Andy: Heather, you’re responsible for the dashboards, and a lot of the data. Do you find this to be helpful, the idea to get in front of the business users?
Heather: Absolutely, yes. People are naturally curious, I feel like here especially, people are super curious. For them to be able to just go and try to answer some of their questions or come up with questions and then try to answer them. Only lean on our team when really more in-depth information is needed, or when they you want to disseminate that information. It allows for a small business intelligence team because we don’t have analysts going off on wild goose chases of, “Oh this person wants to know what this is,” just for their own knowledge we don’t have to that.
Charlie: Yes, just one thing to add there. Just because you have a data lake, doesn’t mean that’s your only source of data. For instance, especially when it comes to reporting, we may be able to give access to the data lake to some users in Looker, but maybe on the back-end we’re able to roll off these aggregate tables in thatch using Hive or Spark and then put those down into something like Redshift or even more transactional, like mySQL, that can be very performant in that setting. Just because you got a data lake, doesn’t mean you don’t have any other data marks. At ibotta, we think about it as the right tool or the right database for the right job. We can have the data lake and batch processes doing the heavy lifting, but put that in kind of a more performant or user friendly database, before it’s delivered to the end consumer.
Andy: Do you consider your data lake more the system of record for everything then? Just like the base of it, and then you spin other things off of it?
Charlie: Nate can probably answer this better, but a large reason why we built the data lake, was to have a centralized source of truth data. Previously we had many different data sources living in many different places, and if you have dashboards built by different or different teams, you might have different metrics that are grabbing different decisions across the company, which could be very detrimental.
Lucas: Into that performance, or schema on-write – whatever you said, that was really good.
Lucas: But, yes, having super performant dashboards maybe running post-press, and then being able to drill down and jump over to press source. I think that’s really cool. For us, 20 seconds performance is fine, on a query. For some business users, that’s probably not okay. So I think that that’s what I’ve seen too, where you want to be able to jump between data sources and databases.
Andy: Cool, thanks. I think we touched on this a little bit already but, machine learning at ibotta. What are you doing, and what’s been the interesting areas for products for machine learning?
Charlie: Yes, I mean it’s kind of enabled us to build products that completely transform the company. I think we kind of officially started the data science team at the beginning of this year, and we’re already up past 12 people on that team, so obviously heavy, heavy emphasis on machine learning and data science, and it’s almost paramount to reach some of the goals of the company. It’s almost involved in every single product that we build these days. I think the biggest mind shift was going from one-to-one deterministic, to a more probabilistic nature. So what is the probability that this may happen in the future? The essence of data science and machine learning problems is, they’re never fully solvable. How do you get that 80% solution in the quickest amount of time, and then move onto the next product that you might be able to add more value with.
I talked a bit about it before, but having predicted values for everything that we might find useful. What’s a user’s demographics? How likely are they to purchase every single brand on every single day? The same thing for every single product category. Building that data enhancement layer, and then building products on top of that, like the recommendation system. A huge push for ibotta is one-to-one personalization inside of the app, and that’s not possible without machine learning. One of the challenges with that is, you could see the data growth at ibotta, and that’s largely due to some of the work that we’re doing, and so how do you take the output of some of these machine learning models? Maybe you have billions and billions of recommendation scores across different A/B tests, how do you actually serve those inside of the app? Those are some of the scaling challenges that we’re working with now.
Andy: Cool, thank you. Just to round it up, can we go around the folks up here, what’s one big challenge and one big obstacle you’ve run into, please be honest, around big data that you incorporate into what you’re doing? What’s one thing that somebody can take away from what you’ve done and hopefully not make the same mistake?
Lucas: So this one, I’ve seen this now a bunch of times and I was surprised by it. I’m not sure if it’s shared by people here. I think one of the challenges that I’ve seen in the data space is that we, and I don’t feel like we all need to be included in that ‘we’, don’t think of ourselves as product managers, per se sometimes. Especially on the analytics front, to not presenting a roadmap of how we get there. We have a vision, now we want to be data driven, but how are we going to get there? What will be achieved when, and make sure all the stakeholders are informed of when things will happen, and delays, and I don’t know. All these classic product things, I think are sometimes missing in the data space, and I think it’s an important mind shift, because there is no– Nobody will have the perfect data platform stack tomorrow. It’s a journey to get there, it’s probably never ending, as we talked about earlier. So I think that’s just super important, that if you are in charge, if you are part of a team, just become a product owner to some extent, and own data that way.
Charlie: Yes, I think for our team, I’m going to go back to a point that Nate made, was not scaling the technologies or the tools that we’re using, but really scaling the teams, and how do you scale the different teams at the right place at the right time? For instance, when there was no data science team, there was no data like, there was no data engineering team at the beginning of the year for our team, and we’ve probably grown the data science team much more aggressively than the data engineering team. Their backlog is pretty big, so if you can figure out how do those two teams interact, and on the higher end front, before we had any of these processes in place, it was hard to find data scientists and data engineers that wanted to come work here, but now that we have the ball rolling, people are really excited to come work on really challenging problems at scale, and work from the bottom up. They don’t have to come into a company working with a bunch of tech and working around that. We’re building everything from start to finish they way that we think is best and getting to use the tools that we think are best for the job. We’re not really restricted on how we approach a problem or what tools you need to do to succeed on that project.
Andy: PS, you’re hiring.
Charlie: PS, we are hiring platform engineers, data engineers, data scientists, data analysts-
Moderator: Mobile engineers.
Charlie: -mobile engineers, Android engineers.
Andy: If you’re looking for a job, come talk to these people.
Charlie: Yes, if you like data, come talk to us afterwards, or apps.
Andy: Just go talk to them.
Ben: Two years ago, I was a pretty cocky engineer who ran a 10 person team that was chartered with building a data lake and making it available, I built it myself pretty much, with the help of my team. About a month later, one of my team members comes and says, “Hey, there’s these Qubole guys. They do everything we do, and they do it really, really well.” I really had to eat crow on that by a lot of people, and that part wasn’t really fun. That’s a sad story. What I would say is that, this big data stuff is really hard. There’s no doubt about it. Most people kind of understand what S3 does, and it’s pretty well documented. But when it comes to the clusters and the compute side of the equation, when you include machine learning, and ETL, and Spark, and Hive and distributed processing and all these different things, it’s freaking hard. There’s a lot to consider.
We signed up for Qubole right away, it wasn’t bad, but honestly, we would have probably signed up for any company that was doing this kind of stuff, because it was so much better than what we had built internally. It’s not to say you should marginalize your own internal operations, but if you can really focus your operations on what’s really core to you, then it’s a better use of your time and effort. It’s something that was very humbling for me, and I’m sure it’s humbling for a lot of people that end up spending a lot of time re-architecting systems. I know a lot of people, a lot of senior engineers, there’s no getting away from it, you are re-architecting systems. I think there are certainly room for partners to help with that, especially in the operations front. I know it fits in with what Qubole is doing, but that’s why I joined the company, because I love having to solve this operation stuff. Anyway, that’s it thanks.
Nate: Yes, I think you can’t leave people behind. I think with big data, the technical challenge, it’s solvable these days. People know what the solution is, but not everyone fully embraces change and you can’t just migrate a system from underneath people and expect them to just jump on board. I think that’s something I’ve learned, you have to account for that, you have to make yourself available. Yes, and it takes work and it’s not just a technical migration, it’s an organizational one.
Heather: For me, I thinks it’s letting perfect be the enemy of good. This is really tough stuff, like Ben said, and you’re probably not going to get it right at first, but that doesn’t mean you don’t move forward. You move forward with the pieces that you know, and do the best you can, and iterate off of that, and eventually you have the solution you need. Just taking this out forward in that journey and something is better than nothing in a lot of cases.
Ron: I would say, quoting Field of Dreams, which is not necessarily one of my favorite movies-
Ron: -actually, I’ve only seen it once I think, but, “Build it and they will come”. We put some data together, made it relatively accessible in Redshift, exposed it through Looker, and I had no idea the amount that people would access this and dig into it, dive in and find great information, give us great guidance and where to go as a company. We went forward from there, and started building this data lake, put things into Hive, made data accessible through Qubole, and the demand was insatiable. Did I mention we’re hiring? Data engineers.
Andy: I’m not sure anybody has [crosstalk].
Ron: Yes, well we are. We are hiring data engineers.
Andy: That’s a surprise, but anyway, it’s your thing.
Ron: So yes, you have the data, you make it accessible, and there will be great demand, but at the same time I don’t want to discredit what the analytic side of the house has done. I’ve heard mention there are companies out there with big data, lots of data, and they don’t know what to do with it and we have a great, phenomenal team that does know what to do with it, and are doing great things with it. Yes, just give the data, put it an an accessible place, and build a team that knows how to use it.
Andy: Great, thank you all, thank you for your time. If anybody has questions, we can turn it over there, but if not, thank you all.