Data Lake and Data Warehouse- Collision or Synergies

Start Free Trial
August 26, 2020 by Updated March 21st, 2024

As the volume, velocity, and variety of data increases, the choice of the right data platform to manage data has never felt more important. Should it be the venerable data warehouse that has served our needs until now, or should it be the data lake that promises support for any kind of data for any kind of workload?

My guest in this episode of the Open Data Lake Talks is John Riewerts, VP of Engineering at Acxiom, a giant in consumer marketing. He is a strategic technologist who has successfully built Acxiom’s data platform, now in use at thousands of organizations across multiple industries, to deliver personalization, audience management, and engagement. Throughout an extensive technology and leadership career, John has been committed to architecting and delivering innovative platforms for analytics and machine learning. He believes that technology solutions should aim to deliver the needs of customers in the best possible way while balancing TCO and longevity.

Listen to this episode as John shares his experience building the data platform at Acxiom and tips on how to choose between a data warehouse and a data lake for your data workloads.

Let the use case drive the data platform. Invest in a best-of-breed solution

John explains use cases should drive the data platform architecture. He believes in a best-of-breed solution that relies on multiple technologies including a data warehouse and data lake. Ultimately his choice balances the complexity and TCO of managing multiple technologies with the ability to run a larger variety of workloads in a performant and cost-effective manner.

Choosing between a data warehouse and a data lake for a use case

John says that you should let the use case determine the platform rather than the reverse. If your use case needs the speed, has a known data model, is fully structured, or is pretty close to it, then a SQL data warehouse will suffice. However, if you need just time flexibility to model your data and use it for multiple workloads, you should use a data lake.

Data processing best practices

John distills these key points from his vast experience. Try to minimize the impact of the three slowest things in your data platform – people, network, and disk operations. While people can never be as fast as computers, he describes the other two as physics problems. To reduce the impact of these problems, avoid duplicating data all over the place, invest in the platform’s ability to read and process data from different locations, including transactionally, pub/subsystems, and data warehouse systems, without having to move that day. Finally, John recommends the use of data processing engines such as Apache Spark to build data pipelines. John’s principles for building a modern data platform are

  • Keep it simple. Don’t over architect or over-engineer it
  • Use the right tool for the right job.
  • Let the use case determine what you should be using.
  • Use the cloud to scale.
  • Separate data from context. This will enable the use of data for multiple use cases

Predicting the future

And finally, John shares his predictions for the future. He believes that we are at a point now where we will be able to use data to not only review the past but understand the present and even predict the future. The data and tools will continuously evolve to help us get there in almost real-time.

Listen to my complete conversation with John as he shares some real-life stories on how he built a modern data platform for workloads ranging from Analytics to Machine Learning.

LISTEN TO THE WEBCAST REPLAY

READ THE TRANSCRIPT

DATA LAKE AND DATA WAREHOUSE- COLLISION OR SYNERGIES

A conversation with John Riewerts, VP Engineering, Acxiom

Utpal Bhatt (00:00):

Good morning everyone, good afternoon, depending on where you are in the time zone. My name is Utpal Bhatt, and I’m the SVP of marketing at Qubole, and we welcome you to our panel today, Data Lake, and Data Warehouse, collision or synergy. First of all, a little bit about Qubole, Qubole provides an open data lake platform to accelerate your data engineering, analytics, and machine learning initiatives.

Utpal Bhatt (00:25):

Qubole’s end-to-end platform radically reduces the administrative effort, and deployment time for various types of primarily data lake workloads, while our patented technology automatically cuts down the cloud costs for data processing by almost 50%, compared to the alternatives. Qubole is used by over 200 customers, including Adobe, and Acxiom, and today I’m joined here by John Riewerts, VP of Engineering at Acxiom. Welcome, John. We are very happy to count him as one of our customers, and we are out here today, to debate a very exciting topic. Before that, John, can you tell us a little bit about Acxiom, and yourself?

John Riewerts (01:18):

Yeah, absolutely. First, I want to say I appreciate the opportunity from you, and Qubole, to engage in what I agree is a very interesting topic in our industry. Start out by introducing myself, I’m the VP of Engineering at Acxiom, focusing on some very, very cool problems that we aim to solve at Acxiom. So, a quick little bit about Acxiom… Share screen just real fast. Can you see the screen okay?

Utpal Bhatt (02:00):

Yep, we can see the screen.

John Riewerts (02:02):

Okay. So, I’ve got a couple of pretty marketing slides, you know that marketing would do this, no engineer would put such cool icons, and colors on such a screen.

John Riewerts (02:23):

So who is Acxiom? We help the world’s largest brands understand consumers, unify marketing, and enable experiences that matter, using data for good.  We are a consumer-driven company. We focus on engagement with our consumers, alongside our clients, and ultimately how to use data in a good, ethical way, across a couple of majors, really talk tracks, if you will.

John Riewerts (03:01):

First and foremost personalization. Actually, in no particular order, focusing on personalization, the ability to recognize all consumers, in all different stages in the marketing ecosystem, and how we can have a really personalized engagement with any consumer along their journey, in real-time. Activation and engagement, as I mentioned, we’re talking about reaching real people with the appropriate messages, at the appropriate time.

The combination of those three is what I see as some unique challenges that actually get right into the very discussion point that we’re talking about today. And third, data-driven decision making. So, it does not make sense, we live in a data-driven world, and we live in a world where data is not going away anytime soon. It’s exponentially getting bigger, and bigger, and bigger, and bigger. And so, how best do we use that for really the benefit of our clients, and the benefit of the consumers that they’re trying to reach?

John Riewerts (04:18):

A quick little bit about… I think it’s worth it, I put in this slide just for a quick little unique value add that I personally get the pleasure of doing, which is seeing use cases across lots of different industries. So personally, this is pretty fun for me, because I get to see what happens in financial services. I get to see how we might leverage different use cases within auto, telecom, CPG, or airlines, you name it. And so, insurance, you name it. So it’s a fun, unique position that I personally sit in, and that Acxiom as a whole sits in, that we really get to play, and help our clients with a lot of different data use cases. We’re a part of a larger global IPG family, focusing on our clients’ brand messages to those consumers.

Utpal Bhatt (05:31):

Hi, great, John. Thanks for that introduction. And this is exactly why this is such a great topic for you to talk to our listeners about, because not only is Acxiom supplying the data platform to a number of industries, and a number of your customers, but this architecture pattern that you have at Acxiom is getting used across a broad spectrum of industries, which makes this a very rich set of experience that you can bring to bear.

Utpal Bhatt (06:06):

Thank you once again. So, before we get started, just a quick logistics. We have, in order to make this session interactive, a number of polls that we will be bringing up during the course of this 45-minute conversation. And, I also encourage everyone to use the chat option. We have folks on the Qubole team ready to answer any questions you may have. We will also curate some of those questions, and then save some time for John to address those questions directly. A quick test on the poll, can we… The first question is, now yeah, let’s see, which continent are you watching this webinar from? If you can take a few seconds to select the option that matches where you are physically located, that would be great.

Utpal Bhatt (07:03):

All right, I’m going to close the poll in 3, 2, and 1 minute. Okay, there you go. Okay, we have a predominantly North American audience, no surprise. This is primarily because of the time zone here. Anyways, so let’s introduce the topic. This is near and dear to my heart. Over the years we have learned how to collect all sorts of data, and today we are collecting structured, semi-structured, and even unstructured data.

Utpal Bhatt (07:32):

We want to learn how to store the data that we are collecting in real-time. Now, the next frontier is how do we start using this data, and getting the most out of it? Whether we are going to use this data for historical analysis, for discovering things in real-time, or even for predicting the future. And, this is where John and I will talk a little bit about what are the right data platform choices to help you make the most out of the data that you’re collecting?

Utpal Bhatt (08:11):

Once again, we’ll make this very interactive. So John, let’s jump right into this topic. But first, a little bit of your journey at Acxiom. Tell us a little about what architecture you started with? And, at different forks, what were the decision points that led you to the architecture that you currently have at Acxiom?

John Riewerts (08:37):

Yeah, for sure. Good question. Acxiom started with its roots in what I would imagine a lot of companies started their roots in, starting with SQL-based data warehouses, and those really solved a good problem, and they had a really good, independent life for some time. I’ve got an interesting slide, for those who are familiar with the ad tech and martech spaces, you will be aware of this diagram.

John Riewerts (09:16):

And so, it’s an interesting slide, because it shows the marketing technology landscape. Now, what’s interesting about this, to me, is exactly what you just said. There’s an ever-growing list of tools that we continue to see, and our industry is no different. Earlier on, a SQL database might have sufficed for the vast majority of use cases. But as we’ve continued to evolve, the reality is this was actually the marketing technology landscape in 2011.

John Riewerts (10:02):

From there, we’ve grown from 150 to 350, to 1,000, to 2,000, to 3,500, to 7,000 different companies really vying for this large market. Because, at the end of the day, this is what companies use to help get their message out, and ultimately have the return on any kind of investment made. And really, where we are sitting at today, is over 8,000 different engagement points from a technology perspective, that anyone CMO could look at, and go, “Okay, how am I going to solve my use cases, for the particular needs that I have?”

John Riewerts (10:55):

What that creates is a fun, interesting little technology challenge that we have, which is to say, a SQL database that probably a lot of us have started with, is ultimately not the answer, and not the one tool that solves all of this large landscape. And in reality, we really need multiple tools. And so, I personally believe that there is no one tool to solve the entire world of problems. And we also have to weigh that against the… We don’t want to bring in every tool for every little unique use case.

John Riewerts (11:43):

And so, it’s an interesting dichotomy to live in, because innovation in this realm is outpacing consolidation. We see this across, whether it’s advertising and promotion, social relationship, commerce, you name it. The management platforms, the DMPs, DSPs, it’s a massive ecosystem that we’re dealing with. And so, how do our clients ultimately solve the right problems, with the right data, at the right time, across multiple channels becomes a challenge. And, a challenge that supersedes just a SQL database.

John Riewerts (12:33):

And so, this is where you asked where the architecture evolved? I mean, just the TLDR for Acxiom’s story is that, in order to enable our clients to use cases and activate our clients’ use cases, we recognized that we needed to build a best of breed set of technologies, to ultimately solve independent problems, that I know we’ll get more deeply into. But, whether that could be that, “Hey, we need a place to start with. Just how do I get data into a consolidated view?”

John Riewerts (13:09):

Forget the actual structure of the data, forget the intent, and purpose of the data. How do I deal with the physics problem of getting the data into some kind of central, logical location, so that I can begin to data model off of that, and activate different use cases that may then go into different technologies, such as documents, stores, pub subsystems, API driven systems, warehouses, you name it? So, at the end of the day, what we have found is that it is not one technology to solve all. When you bring in multiple technologies, that do increase some complexity but are done right, you can get a lot of gains, and a lot of dividends off of what we’ve architected out.

John Riewerts (14:08):

And so, I think, to really finish that out too, what we have seen, and I’m sure the listeners can relate to this, is that there also is no magical tool out there. Whether it be at the app layer, the data layer, or the compute layer, that just magically solves all this. At the end of the day, there’s still a data engineering effort to think through how we approach these data to use cases. Luckily, this is where Acxiom gets the pleasure of helping our clients ultimately achieve, and activate those use cases, through those data engineering exercises.

Utpal Bhatt (14:50):

Got it. Got it. So, John, I think we talked earlier about the three categories of benefits, we talked about personalization, activation, and then decision-making processes. So, can you help us with a couple of examples of use cases, where, and how this manifests in the Acxiom platform? And, what is the underlying data architecture that supports that particular use case?

John Riewerts (15:19):

Yeah, yeah. Great question. So at a high level, more pretty architecture slides with cool icons. But, yeah, I mean, you can break down those different use cases we mentioned. I mentioned the first three. We, from a technical perspective, see them through a different lens, whether it’s analytics, audience engagement, or audience management, and then finally, real-time engagement. Ultimately, from a technical perspective, most of the use cases fall within those realms, collaboration being horizontal across those.

John Riewerts (15:56):

And so, here are a couple of quick, one-word-based use cases, or groups of use cases, that we often deal with, with our clients. A couple, let’s get down into the weeds type examples, and we could talk through the different technologies. It might be something, we could start with very traditional use cases. I have 20 different feeds of data. How do I get that 20 different feeds of data into a single data model, that I can ultimately action, and campaign against?

John Riewerts (16:36):

We’ll start with the most basic of use cases, right? So this gets into, “Okay, how do we link this data together? How do we build a brand identity graph that we can use to recognize these 20, or 50, or 100 different data types, or categories of data that we can bring together to ultimately activate.” So, it might be simple campaigning across multiple channels, which gets into the multi-channel measurement.

John Riewerts (17:07):

So, my wife is a wonderful example. She’d probably not like knowing that I’m using her as an example in this, but I’m going to do it anyway. So, my wife’s a great example. She often gets advertisements via email. She will often look at those emails, and click on that email, probably brought to an online store, which is now developing a clickstream of her behavior. Then, she’s actually very prone to going to looking at the reviews, reviewing a particular product, maybe even potentially adding it to a cart, and then the cart is abandoned. Leaving the cart, turning around, and all because she could probably drive five miles to the store that’s the center of the advertisement, and ultimately finally convert offline.

John Riewerts (18:04):

So, one of the fun challenges there, is that little use case alone probably requires nine to 10 different feeds of data to ultimately recognize the ROI of the initial email campaign, or social campaign, wherever she got the ad first. And so, how can we stitch all that together, so that we can truly see the return on that initial campaign that was done, that ended up as a cross channel conversion, that ultimately jumped across multiple channels.

John Riewerts (18:38):

So, in that type of example, at Acxiom, we often start with a data lake. Quite frankly, just full transparency, this is where we leverage Qubole heavily, from a compute platform perspective, on top of shared storage services like S3. And so, we might bring that data in, how do we then ultimately tie all that data together, so that we can recognize that consumer journey across the different channels, to finally recognize that conversion?

John Riewerts (19:21):

Now, that might have occurred by developing particular models around that data, to infer that conversion, where we may be using some kind of modeling platform, or we may be tying this into ultimately reporting that we’re going to be showing to a CMO, to say, “Hey, here’s your final conversion.” Where we’re pulling in different third-party BI tools. These different use cases, pulling in these different third-party partners, requires different underlying technology sometimes, because we all know that you can’t just necessarily go to Tool A and say, “Hey, I need you to work with this underlying data technology.” So the reality is, that we ultimately have to have a couple different best of breed toolsets, to be able to ultimately enact that use case.

John Riewerts (20:18):

Another one might be a financial services client who sends out an ad to sign up for credit cards, something along those lines, and we might want to have a real-time engagement with that consumer, ultimately strengthening their journey, as they’re trying to identify the appropriate credit cards. Well, now this gets into, whether it be 200 millisecond response times, or often smaller, which requires, again, a different data technology such as a document store, like MongoDB or something like that.

John Riewerts (20:59):

So, whether you’re building initial campaigns, and a client may have heavy data analysts who understand SQL, or you’re working with different data scientists, who understand a plethora of different technologies, their languages, their Python, R, pulling in different frameworks, like TensorFlow, you name it, tying all that together in a cohesive manner is the fun challenge. But at the root of that, in my personal opinion, this is where the data lake and the data warehouse have a really nice relationship with each other. And, if you only had one use case, then you may not need this, right?

John Riewerts (21:57):

I mean, if you’re Just quite literally trying to solve a very simple use case, with a couple of very simple data inputs, or data feeds, you may not need this complexity. But the reality is, that the vast majority of us have more than one use case. And, we have to, as engineers, consider what future use cases we may have in the future. And so, that plays into, let’s take a next best offer a use case, where a particular event comes in, we’ve got to make a decision right at that point in time, how we want to treat that event, to ultimately help the consumer on their journey in an asynchronous manner, to make a decision on the next best approach that we’re going to work with that consumer.

John Riewerts (22:50):

That, while a transactional use case, then also has a… We build up data, whether it be at the event itself, or exhaust from the subsequent offers. And we can absolutely apply analytics around it. So, even a transactional use case ends up having a large volume of data, that we could potentially leverage for the next use case or the next challenge that we want to take.

Utpal Bhatt (23:28):

Yeah, I love the fact that it’s the use case that determines the data platform. And if you have just one use of the data, you could engineer it differently but if you have multiple uses of the data, then you may want to look at the future, and how you’re going to use it in a different way. Great. So, given that Acxiom has so many different ways of solving the data, the use case [inaudible 00:23:59], how do you know you’ve got it right? So how do you measure the success of the initiative? What are some of the ways to determine the ROI, and feel good about mapping the use case to the underlying technology?

John Riewerts (24:16):

Yeah. Good question. At the end of the day, we know we’ve got it right when our clients know they got it right. I mean, that’s kind of cliché to say, but it’s the truth. I mean, we live for our clients to be able to say, “Okay,” Let me help you take that next best offer use case, and go, “Okay. All right, let’s break this down. We’re going to have this amount of spend on this marketing effort. Are we translating to this particular goal of revenue that the marketing officer is ultimately attempting to achieve?”

John Riewerts (24:58):

So, at the end of the day, not to make it overly simple, but it is that simple. We measure our success based on our consumers, and our own customers and their success. Now, underlying that, there’s potential… There’s down in the weeds types of things that we can look at, take a look at. Does it make sense to have the overhead of 30 different data technologies, to solve 30 different use cases? Probably not, because now you’re reducing your ROI, based on how much you’re investing.

John Riewerts (25:42):

So, okay, great. Let’s find a couple of best of breed, if you will, let’s find a really performant data warehouse out there. Let’s find a really performant document store out there, and we may be able to leverage those four different types of use cases so that we can ultimately have the leanest footprint of a platform but be able to scale appropriately, to not just the volume of the use cases, but the variance in the use cases themselves.

Utpal Bhatt (26:21):

Got it. Got it. That’s great, in terms of a quick summary for our listeners, then, it really ends up being, at the end of the day, you need to deliver your stakeholder satisfaction, in this case, their end customers. If they’re happy, I think you feel like you’re getting it right. And then, of course, the economics of getting that right, in terms of the number of people, number of platforms you’re managing, and so on.

Utpal Bhatt (26:51):

So that is a great segue for us to get into our next segment, because this is a great setup, or for us as we hear about your own experiences, working on a platform at Acxiom. What I would like to do in the next segment is lift it up a level. I know a lot of our attendees at the webinar are also grappling with the same types of questions. When do I use a data lake? When do I use a data warehouse? How do I know I’ve got it right, or not?

Utpal Bhatt (27:26):

But before we do that, we have a couple of poll questions that we have. So, let’s go through a couple of poll questions to see where everyone is in their journey today. If everyone can take 10 seconds to pick your choice? The question is, which platform are you using? Are you using a data lake, a data warehouse, or are you using both or none of the above? All right, we’ll close the question in three seconds. All right [inaudible 00:28:00]. Alright, so almost 50% are using both, which is fantastic. And then we have another 26 using the warehouse, and 13.

Utpal Bhatt (28:13):

Let’s go with the next question as well, Ashley. How large is your current data footprint? All right, we can close in three seconds. All right, let’s go ahead and close it. God, lots and lots of data. We have almost two-thirds of our audience using upwards of [inaudible 00:28:52] terabytes. That’s great.

Utpal Bhatt (28:56):

So, in the next section, the next section really deals with the industry as a whole. I mean, there’s a raging debate going on right now, on the use of a data lake, versus a data warehouse. What is the footprint of a data warehouse, in terms of use cases? And alternatively, what is the footprint of a data lake in terms of use cases? So the first question for you, John, is [inaudible 00:29:28] are the differences between the two architectural approaches for us? What is a data warehouse? What is it good for? Versus, what is the data lake? What is the data lake good for? And as a starting [inaudible 00:29:43].

John Riewerts (29:43):

Yeah, I think if you were to probably search on the internet, you’ll get several different variances of answers, but I think what most of those largely agree on, is the idea that a data warehouse has helped us get to where we’re at today. I don’t think we can say that we’ve not used the other at some point in time. And so, at the very crux of it, if you need the speed, and the optimization and you know your data model, as it relates to a very specific data model, it’s a fully structured, or pretty close to it, some semi-structured, but usually pretty fully structured data model, and you can narrow it down to that. That’s essentially where you’re at with a data warehouse.

John Riewerts (30:52):

Most often, I think you’ll see a lot of papers associate this with SQL-based warehouses as well. So, Snowflake, if you will, where you have a SQL-based interface, heavily rooted in a really nice, well-defined data model. Where you see data lakes, this comes into the idea, and the notion that, get the data first, then declare the intent. And so, let’s just focus on getting data first. And that data may be unstructured. It itself may be structured, but we don’t know the structure of it. And, we don’t necessarily know the particular context that we want to use it in yet.

John Riewerts (31:45):

So, it’s the recognition that data is ugly, and let’s embrace it, instead of hide from it. And so, I think that’s where data lakes tend to live, is this idea that I don’t want to say Wild Wild West, but it is the idea of democratizing different data sets for use. Some of which may be structured, some of which may have a little bit more of a data model, and some of which may have a little bit less of a data model. I think also, too, that you see is the flexibility in the use cases.

John Riewerts (32:23):

So, whereas, if you were to put up against a high-performance SQL engine, with appropriate indexing, and things of that nature, yeah, that’s going to outperform for the particular data model, and requests that that model was set up to support. Now, if you need a just in time, you don’t necessarily know what that use case is, and you need a lot of flexibility around potentially bringing in modeling technologies, or you’re just trying to do data discovery, you need to understand what you’re dealing with before you can actually activate it, and use it.

John Riewerts (33:12):

This is where the true flexibility of data lakes comes into play. One of the things I like about this is that it’s the best… The nice thing about this is, and why I think both of them have a place, often, in sometimes a single solution, is that competition is a good thing. We live in a wonderful world where there are more potential offers of how we can structure data, compute against that data, organize that data, and serve that data, than ever before.

John Riewerts (33:55):

And so, whether we can… At Acxiom, we often start with a lake, because it’s all about the idea of getting the data in first. But then often end up either leveraging the lake itself, for particular use cases, or funneling some kind of aggregate to that data, of that data, to put potential ODSs, or potential OLAP, OLTP stores, if you will, to solve very particular use cases. So, ultimately, there’s a nice tie-in on the very theoretical definition of the two.

Utpal Bhatt (34:39):

Great, I liked the articulation of the two in terms of a structured SQL based approach, which naturally lends itself to a warehouse, and that’s where you’re going to get the most efficiency and performance, whereas a more… A newer kind of use cases that require the flexibility of data, and more interactive data engineering will lend themselves to the data lake.

Utpal Bhatt (35:06):

So, I saw a question, somebody posted in our chat, as well, which is a really good question. I know a lot of our listeners are grappling with this one. As the market gets louder, and louder in terms of a drumbeat for one or the other approach, and the overlap. What are some of the anti-patterns with data lakes, and data warehouses? In other words, at what point, let’s say if you were to try using one for a use case that it’s not meant for, or suited for, what are some of the perils of doing that? And how does it manifest from a practitioner’s point of view?

John Riewerts (35:55):

Yeah, that’s a good question. You’re not going to like this very vague answer. But, I’m going to give it anyway. Ultimately, I think it depends on the use case. Right? So, a good example of an anti-pattern. If you know your model, if you know the model of data, and you know that, in all likelihood, that’s not going to change, and that use case pattern is not going to change, maybe a data lake isn’t right for you. Because you know your model, you know the data that you’re getting, and you know there’s not going to be variance.

John Riewerts (36:35):

On the flip side, if any of those are false, I personally think then you may consider starting with a lake. The good thing is, especially if you take this in a public cloud approach, storage is cheap, right? I mean, at the end of the day, storage is cheap, and can also be managed. The reality is, in my experience, to the petabytes size lakes, the vast majority of that data is for some deep, archival purposes, and there’s really only a small sliver of that data that gets actively worked upon.

John Riewerts (37:20):

But, at the same time, you know you may need that data for some kind of historical trend use case. And so, that’s a great example of where to have something like a lake as a catch-all storage manner. And, as I said, we base ours, at the very base of it is on S3, or object storage. And the reason we do is that we’ve recognized that there are a bunch of compute platforms that can do a good enough job working directly with S3.

John Riewerts (38:03):

So we have a physics problem, right? We don’t want to ship data anymore than we absolutely have to, which is another anti-pattern, right? Because we don’t want to be duplicating data all over the place, for every single unique use case. If there’s only a 5% gain on a particular query pattern by switching to new technology, is a 5% gain worth it? So, really the data of how the performance of activating those use cases can drive and should drive whether or not it’s good to use different technology.

John Riewerts (38:38):

A great example. As I said, a high-performance SQL warehouse. If your use cases are very specific, almost always tied to SQL, well, heck yeah, use that. Now if you also have needs where you may want to pull in TensorFlow to do discovery or some kind of propensity model against that data, then okay, great, there are alternatives that can give us SQL, and can give us those capabilities.

John Riewerts (39:19):

But, if the SQL has a particular SLA that needs to be met, because it’s more operational in nature, and it needs to be under 200 milliseconds, maybe SQL on top of a data lake may not make sense. And so, these are those back and forth that ultimately we’ve, or I’ve personally experienced, that have to weigh, what is the right tool for the right job, to ultimately get there.

John Riewerts (39:48):

I think as it pertains to, is it time to switch, personally I think, like I said, if it’s a very known use case, maybe. But if you foresee additional use cases coming into play, that are not just batch-oriented as well, that could be transactional or streaming-based oriented, then maybe not. And maybe actually, as opposed to switching, it may be more appropriate to add, to complement the data lake with a data warehouse.

John Riewerts (40:25):

And so, again, a good example that we often deal with is, again, a next best offer approach. So, I’ve got an event coming in. That’s a single record, some kind of event on a Kinesis, or Kafka or pub/sub topic that’s coming in. We need to act upon it. We need something highly performant that’s distributed to act upon that. But at the same time, we may want to keep that for later. Whether that’s for training purposes for a particular model, or we may contribute that to a data model that we ultimately expose into a warehouse. So, just the idea that, what I’ve experienced at least, is if you have more than one use case, and you have a feeling you’re going to get more use cases, this is where the additive approach of a data lake, plus a data warehouse comes into play. Within reason.

Utpal Bhatt (41:30):

[crosstalk 00:41:30] Great. Yeah, that’s a really good way to summarize it, is that, yeah, if you have one use case, and if it’s SQL only, high-performance SQL, we’re there to store that data. But if you have multiple use cases, or usages for your data, some unknown, then it might not be such a straightforward answer, even though the first use case might be SQL.

John Riewerts (41:55):

If I could add one more thing, just thinking about-

Utpal Bhatt (41:58):

Sure.

John Riewerts (42:00):

Again, we as consumers of technology, are getting to enjoy the fruits of everybody’s labor, because there’s high competition for you to have your data in somebody’s technology. And so, what we’re seeing is, we see a lot of technologies, who are really making some great advancements over this idea that it can take on more than one simple use case. So, you see these data warehouses that can now expand into more and more use cases. You see data lakes that can say, “Hey, I can get even more high performant.” Whether I’m bringing in ACID Hive, Delta Lake, or something like that, to help optimize data storage, to eke out more performance.

John Riewerts (42:55):

Or on the warehouse side, you see companies who are coming out, going, “I don’t have to just use SQL, I could do other things too, against this data.” So, you see this convergence coming from both sides. The cool thing about this is, is that we win. The consumer wins because now we can start to leverage these different technologies, where appropriate.

Utpal Bhatt (43:24):

Got it. Got it. So on that note, John, we have one more poll coming up. We just want to see, in terms of our audience, where are they with regard to their data platform journey? So, Ashley, do we have a poll question coming up? Okay, let’s close in three seconds. 3, 2, 1. All right, let’s see what the results of the poll are. Okay, a similar sentiment. A majority of the audience is still looking at, and those who have added, looking at adding both into the mix.

Utpal Bhatt (44:25):

So, this has been a fantastic conversation. I think we’d be remiss if we didn’t ask you to look into your crystal ball, and predict where you think these markets are going to end up. We have, as you point out, innovation happening on the data lakeside, innovation happening on the data warehouse side, at least one could argue right now, it’s still… There is a clear kind of demarcation in terms of use cases, and the first use case for a particular piece of technology. But looking into the future, what are your thoughts? And also, as a practitioner, where would you like the market to go?

John Riewerts (45:12):

Yeah. You subtly mentioned a comment at the beginning that I think is so true, and it’s this idea of past, present, and future. Personally, I think we are at a point now… 10, 15 years ago, we were at a point of, how could we take data, and potentially reach a customer in the next month, or the next week, or the next day, based off of that data?

John Riewerts (45:39):

In the last decade or so, we’ve gotten to, how can we reach a customer in the next 10 minutes? In the next five minutes? With predictive modeling that’s taking place, what I often see is that it’s almost like how can we future predict what you’re about to do? And so, in order to do that, you ultimately have to understand the past, and understand behaviors of how journeys have happened, to be able to predict that future.

John Riewerts (46:20):

So, what I see in the future is even more data, no shortage of that. But that data is driven by more, and more, and more event-based approaches. So, as opposed to us deciding, “Hey, I want to target this large-scale audience.” It is continually, and this is not necessarily something brand new, because it’s been happening for at least half a decade. But, how do I take events that are happening, that I can glean to, and predictively model where we could potentially help consumers make decisions.

John Riewerts (47:07):

And this, I mentioned a lot of marketing use cases. But I think that this is across all sorts of industries. I mean, hey, look at SpaceX with the rockets that have been launched. I mean, the amount of modeling that has to take place, to be able to predict just in time events as they’re happening, as a rocket is being launched, or as it’s coming down, being able to predict that just in time, requires that historical knowledge, to be able to drive a future action, or direction, whether it be for a rocket, whether it be in the automotive industry, healthcare industry. I mean, look at what we’re dealing with COVID. I mean, we are constantly trying to take in data in real-time.

John Riewerts (47:59):

It started out by going, “Okay, let’s all start aggregating daily.” Now it’s, let’s start aggregating monthly. This is a microcosm of exactly the evolution. Now it’s like, “Okay, now can we potentially predict…” Germany just released an app to say, “Hey, have you been exposed? Or do you know anybody that’s exposed, so that we can kind of predict, okay, geographically? How can we build models that show where potential exposure happens?” These are all great examples of where more, and more, and more, I see it evolving into predictive analytics, to help shape a future, as opposed to, I don’t want to say manually reacting to a past, but yeah, manually reacting to a past.

John Riewerts (48:45):

Now, machine learning algorithms require some sort of understanding of the past. So, you have to have that data set that represents the past, to be able to predict the future in real-time. So, it’s a nice symbiosis between batching real-time that’s happening.

Utpal Bhatt (49:07):

That’s fantastic. Yeah, I think that we figured out through BI platforms, that industry has matured significantly, to use data to understand the past. I think we are at a frontier now where real-time, or streaming analytics, and of course, machine learning, is becoming more and more prevalent. And it’s definitely going to be something that we’ll see at the same level of maturity as BI platforms going forward. So, using the data to understand the present, and using the data to predict the future. Great.

Utpal Bhatt (49:43):

John, this has been a fantastic talk. There’s one question from the audience here that we haven’t answered yet. So, the question is, what are your thoughts on how do you join data between a data lake and a data warehouse? As opposed to just simply copying it there.

John Riewerts (50:04):

Yeah, great questions. So, at the end of the day, what Acxiom, what we leverage currently is Spark. We don’t think of Spark as like a traditional Hadoop, we almost think of Spark as we look to providers like Qubole, to really look at Spark as a service. It’s a compute service. And so, first, there’s a physics problem. If you have to do joins across two different locations of data, whether it be a data lake, data warehouse, or two different data stores, one might be a document store, one might be a relational database, and one might be a key-value store.

John Riewerts (50:53):

What we have found is that ultimately, we see a lot of value in leveraging Spark, which represents a distributed in-memory way of computing, to be able to read that data. First off, if there’s any kind of pre-aggregations that we can do within the original platforms, let’s take advantage of that compute, bring that into a data frame, and then ultimately leverage the joins in memory, in a distributed fashion, leveraging Spark. The value that we ultimately resolve to, is seeing Spark as that compute platform, almost like… I promise you, they didn’t ask me to say this, but this is honestly the way we view Qubole, it’s kind of like a Lambda for Spark. It’s a serverless general approach to Spark. Spark is a managed service. And so, that way we can increase, and decrease the compute need in a just in time fashion, to ultimately be able to do these joins.

John Riewerts (52:04):

The three slowest things in any system, are humans, network, and I/O, I/O to disk. Those are the three slowest things. I try to work fast, but I’ll never work as fast as a computer. The other two are physics problems. And so, by being able to have a platform that has a high degree of flexibility, in terms of being able to read from different locations, disparate locations, including transactionally, pub/subsystems, and warehouse systems, and be able to do that type of join appropriately, is why we’ve landed ultimately in Spark to handle the majority of the load. That’s not to say all the time, but the vast majority, we use Spark to ultimately leverage that join, within the data frame.

Utpal Bhatt (53:08):

That’s great. I like the use of physics in this explanation, and I think it’s very enlightening because, at the end of the day, you’re absolutely right. If there is data movement involved, then there is latency, there’s a performance hit, there’s cost, all sorts of things. So, minimizing… It’s not going against the physics of data movement, if you can achieve that sort of performance, keeping the data in place is definitely the way to go about it. I’m going to remember that. So, I don’t see any other questions. So John, any last comments for our listeners?

John Riewerts (53:56):

We live in a fun time. As a data geek, there are a lot of different ways to solve problems. I think one of the… As we were prepping questions, one of the last ones was, suggestions on how we might approach, and lessons learned that I’ve had. And it really does come back to a couple of key things for me, that I’ve learned is, one, don’t over architect. You can over data model. You can bring in too many technologies, and you can try to come up with the most optimized way to solve a particular minute, micro use case, but it may ultimately result in a higher TCO when combined… The total cost of ownership, when combined with other use cases.

John Riewerts (54:58):

So, if it’s 95% good enough, and you spend $600,000 in R&D to try to squeak out $500 a month of usage, that may not be worth it. So, over architecting sometimes is a reverse problem of having so many nice tools to play with. So, sometimes good enough is literally good enough. I think also, for me, I’ve been hit with this personally. Choosing the right tool for the right job requires you to know the job first. Jumping into saying, “Hey, I’m going to do a data warehouse.” Or, “I’m going to do a data lake.” Or, “I’m going to do a document store.” Or whatever, without understanding what the use cases are first, often got me into trouble. I often would do that at times earlier on, and learned the hard way that really, it’s more important to understand the true use case, and the acceptance criteria behind that use case, and then start evaluating the best tools.

John Riewerts (56:21):

I think also, the cost is obviously a key component to this. One of the things that Acxiom learned early on, especially as we were going into the public cloud, really lifts and shifts from a traditional data center to a public cloud is… I don’t want to say is always a recipe for disaster. I’m sure, there are a couple of use cases where it’s good. But, more often than not, it does require some level of re-architecture, and you’ve got to weigh that over the don’t over architect. But, one of the foundational things that we’ve come across, is the separation of compute and storage. Because what we have found is that we can scale both of those independently, very, very well. And we’re actually seeing this in the industry, where there are so many different toolsets, whether it be SQL based toolsets, data lake based tool sets, key-value store based toolsets, that are taking this notion on, of separation of compute and storage, that allows us to independently scale each of those.

Utpal Bhatt (57:39):

Great, great, again, that’s a great summary. So keep it simple. Don’t over architect or over-engineer it, use the right tool for the right job. Let the use case determine what you should be using. And the cloud does come to the rescue when it comes to scaling. So, great points.

John Riewerts (57:58):

There’s one more I want to squeeze in there if I can.

Utpal Bhatt (58:01):

Sure thing.

John Riewerts (58:03):

Separate data from context. Data coming in doesn’t necessarily have a context of what you want to use with it. So, separate the idea of getting data to a location, prior to figuring out what you want to do with it. Because in all reality, you’re going to have multiple uses of that data. And so, you never know what you may use that data for. So, if you start with just getting the data in first, and then figuring out what you want to do with it, it generally leads to a more positive outcome with that data.

Utpal Bhatt (58:38):

Great. Excellent. Well, John, thank you so much. We’re out of time. It was a real pleasure having you on this topic. And yeah, thank you so much. This was extremely informative. I learned a lot. I’m sure our listeners did, too. Once again, thank you, and this concludes our panel today.

John Riewerts (58:59):

Thank you, really appreciate the time. Thanks, everybody for joining.

Utpal Bhatt (59:03):

Thanks, everyone.

Start Free Trial
Read Introducing Apache Spark 3.0 on Qubole