Qubole & Snowflake partnership briefing at Big Data World London, 2018


Richard Lawrence: Good morning everyone. My name is Richard Lawrence from Qubole. Thank you for joining us today in the Machine Learning & AI Theatre. We’ve got about 20 minutes where we’re going to talk about the partnership that was recently announced with Snowflake. I got a co-presenter here. Peter is going to talk to you about the value that Snowflake brings to this, and I’m going to talk about the Qubole piece. Thank you for your time.

We’d like to say if you’d like to ask questions, we have quite a tight amount of time, but please feel free to ask questions as we go. As you may have noticed, strategically, the Snowflake booth is just outside, and the Qubole booth is just outside as well. If we don’t get time to answer all of your questions in the session, please feel free to come and talk to us.

You may have seen that Matt is giving out our data platform book. It’s a really good read, actually. This was written by our two founders. These were the gentlemen that built the data platform at Facebook, and it tells a really good story of the challenges they found at Facebook back in 2007; how they built the data platform for Facebook to allow all of the employees at Facebook to use data to make decisions which back into 2007 was a pretty tough thing to do. Along the way, they did a lot of work, and they co-authored Apache Hive. They started to create the tools to be able to make data and analytics and data science accessible to more business users.

It’s a really good read, actually. We do have a PDF copy as well if you don’t want to carry a book around with you for the rest of the conference. Just to explain my background, I’ve been in data pretty much my whole career. I started in ETL and business analytics. I’ve also done some time in the database space and the NoSQL space. Data is obviously critical to what we’re doing today and hence why we have such a big conference. I’ll let Peter introduce himself.

Peter Jansens: Absolutely. Thank you, Richard. My name is Peter Jansens from the Netherlands, currently supporting Snowflake. I just learned that my background is very similar to Richard’s from the data integration, ETL space, business intelligence—way back when it was called the decision support systems. You know, grey-head guy talking. Like I said, currently, the move from bringing the data into the cloud arena was almost natural to me which has led me to the support of Snowflake where we are today. Of course, it’s very good to have a database that can contain all your data, but the key thing is, of course, the combination of the products that can help you bring value from that data.

Richard: It’s probably worth just setting the scene in terms of where the partnership came from. We’re going to start off by talking a little bit about what Qubole does, then we’re going to talk about what Snowflake does. Then, we’re going to talk you through three of the use cases that we see customers using in this area. The partnership came from a customer demand. A number of existing customers of Qubole and a number of existing customers of Snowflake realized there was value in terms of bringing the two technologies together. Like I said, we’ll talk you through some examples.

I guess, overall, what we’re trying to do is make data more valuable to your organization, make sure that your business analysts and your data scientists can actually get access to more data but using that data more simply and actually producing more results with the data. We’ll just talk you through some customer examples and explain the partnership. What we’re now doing is working very closely on the ground. Qubole has an office here in London; Snowflake has an office here in London, so we’re engaging directly with customers where they see value in us working together.

Just in terms of explaining the problem that Qubole solves, as I’ve said, we’re seeing more and more people that are realizing how critical data is to their business decision making process. Many companies today are trying to drive every business decision with data. It makes sense. It seems like a way to get more value. If you’re a retailer, for example, you want to have personalization, you want to have a 360 view of your customer. Data is becoming more and more important.

What you can see along the top here is more and more users want access to data, more and more users want some type of self-service. There are more use cases; there are more data scientists; there are more data analysts; there are more tools out there. The Tableau’s and the Microstrategy’s of this world. There is much demand at the top for growing data requirements.

The problem is the guys down here towards the bottom, the data engineering, DataOps teams, they’re under more and more pressure to deliver to the users. We start to see a problem here in terms of the ratio because, in a good company, one DataOps, data admin, data infrastructure person may be able to support eight to ten users, eight to ten data scientists or data analysts, which might work when you’ve only got a handful of users of the top. Obviously, as the number of users grow, what you can’t do is keep manually throwing people at this problem. You can’t keep manually introducing more people. You can’t have manual processes. You need to try and automate that as much as possible.

It’s a pretty straightforward area that you’re going to see a bottleneck developing. What a lot of our customers are trying to do is they’re trying to encourage more data science use cases, data analytic use cases at the top, allow the people at the top to really focus on their data and allow them to not worry about the infrastructure underneath. If you’re going to manage that infrastructure, particularly in the cloud where you’ve got the option to have a lot more capacity and to scale up and scale down, clearly you need to put an automation layer out there because, as we’ve seen, these people down here—data engineering people—are much in demand, and there just isn’t enough for them to go around.

What Qubole offers is a platform to both provide capabilities to the top line, to the data scientists and data analysts, but also help to manage the data infrastructure. Help to make the data engineering, the DataOps guys, much more productive. Peter?

Peter: Of course, the amount of data that needs to be managed is growing and growing and growing and growing and probably– I see some nodding faces here. The amount of data that needs to be managed is growing, and that makes life very easier. All you need to do is hire a little army of DBAs and data engineers to support all that and to actually make it happen. Of course, I see smiling faces, everybody knows that’s not the case. From a data management perspective, from the underlying data infrastructure perspective, Snowflake had a very specific–

Richard: Just press it one more time.

Peter: Let’s try that button again. Snowflake had a very specific philosophy towards the architecture that we wanted to implement because nothing is as easy as to take your data warehouse database, put it in the cloud environment, and off you go. Not so. That’s also bringing all the limitations of your own brand database to the cloud as well. I mean, the cloud, if anything, it’s unlimited. Therefore, we have started to build a data warehouse specifically for the cloud, and what we wanted to actually go and do is use it as if it’s just another database like you’ve been doing for decades now. Use it like it’s any other database like you’ve been doing for decades now.

This is a marketing slide. I always say that because there’s probably marketing people in the room. I’m a techie. Minimal management. I see people smiling and nodding, going like, “Yes. They’re all management.” Minimal management. For the people who are used to using databases, it’s always interesting to know, “Am I going to put these data in separate tables, or am I going to place them together? Am I going to flatten this structure or should it be a star schema?”

That’s the type of thing that you would be doing, and that’s the kind of decision that you instantly start to make. You do not start to make decisions around, “Should I buy an IBM box or a Dell box or an HP box for this?” because it’s in the cloud. It’s already there. All you need is a URL, a link, to get to that database. Another interesting thing is when you think about relational databases, if you think about SQL, you’re thinking structured data. Rows and columns.

We said, “Well, you know what, lately a big data is all about much more than just the basic rows and columns.” We said, “Well, from the ingestion perspective, when we get data into the system, we want that data to be potentially–” The term is also used, unstructured. I hate that term because it doesn’t make sense. Every data has structure. Let’s call it semi-structured. Think XML, think JSON, think documents; these types of things. You would want the capability to ingest that data in the system as well. That also means that you would want to be able to support all of your users.

It’s like, “All of your users. Yes, duh. Of course, you do.” Think about it. Think about the environment where you would have your data warehouse basically, where you would have all your data available to both the data scientist who will be running from top left to bottom right queries, looking for trends and outliers; but the same database being used by your business intelligence users, your marketing analysts saying, “Hey, I have a very specific question about so what’s this marketing action in the XL today have as a result?” for example. What you would want to do is have everybody work on the very same data using the tools they are used to using, for example, SQL.

Of course, the one on my far left, on the far right for you, we’re in the clouds. If we’re not using this system, we don’t want to pay for the system; and if we use the system a lot, then we are prepared to pay a lot for that system. Spin up a data warehouse, think about it. That’s really what Snowflake is all about. The data is there. All you need to do is spin up your warehouse and get access to the data, and not just get access to the data but get unlimited concurrent access to the data because if I want to connect a hundred users from the marketing department to that database in the cloud, I would spin up enough warehouses that use that same single source of the truth.

From a concurrency perspective, you’re very limited, basically. You’re limited to what the cloud provider can supply you with. I’m not sure if anyone’s going to hit that limitation, not in my lifetime anyway. That’s really the key things, what this is all about. I was given two minutes, and if you let me, I’ll talk for two days. These are the key philosophies in the architecture of the underlying data management structure.

Richard: We have about five more minutes left, so we’ll move on to the three use cases that our customers are asking us to cover. I’ll let Peter explain the first two, and then I’ll do the third if that’s okay.

Peter: Absolutely. I’ll tell you this first one because this is around advanced data preparation. You are, basically, let’s say, the data scientist making sure that the data is bringing you value. Where does that data go? In this situation, you would have to Qubole running in the clouds where all the artificial intelligence, the AI processing, and algorithms are running. You would have Snowflake—the virtual warehouse—running in the cloud, accessing its data which in Amazon would live in S3 because that’s the cheapest place to store your data.

Qubole does have a direct interface based on the communications and the cooperation that the Qubole and Snowflake have: getting data from the system, from the database, doing all kinds of data preparation really based on, again, trends and outliers, regressions, what have you. Then pushing the results of those analyses, pushing that cleansed data, pushing only the features that are relevant to a specific set of users back to that same virtual warehouse. That interface, it’s a two-way interface.

Richard: Sorry to interrupt. For example, this could be clickstream data where you want to use Apache Spark, as we have here, to kind of aggregate the data down and find the more useful information in the data. Then in this first example, you’re preparing it and then pushing it back to Snowflake where you build your data warehouse. [unintelligible 00:14:58] comes next.

Use case number two. The first two use cases that you’ve seen are around data engineers, and then, the third use case is going to be more for data science. In the second use case, this is basically advanced data preparation on Snowflake and other data sources. In this example, what’s different is Qubole, in this case, is taking the data from S3, assuming we’re using Amazon as our example. We’re pulling the data into Qubole, doing some kind of first pass aggregation, some modeling, maybe, on the data. Then we’re passing the data back to Snowflake where you do more of the data warehousing type stuff.

The third one. I’ll just introduce the final one, then I’ll hand over to Peter. The third one, like I said, is more about the data scientist persona. What you probably noticed here is we’re trying to produce value for different personas, so whether it be a data engineer or a data scientist. This one is more around the data scientist. What we’re doing here, again, we have the data loading into Snowflake. This is for customers who want to do more advanced analytics on their data.

As Peter said, Snowflake is very good for data warehousing business type of analytics. This is much more for a data scientist. They want to build models. They want to test models out. They maybe want to write a much more complex personalization engine or data aggregation engine, and then they can do that because they’re taking the data from Snowflake. Peter, do you want to finish off?

Great, everyone. Thank you for your time. We do have time for questions or if anyone would like to come up at the end. Can I just do a quick show of hands in terms of who uses the cloud here? A few people. Great. Okay. We’re just trying to get a gauge because obviously, both companies are all in the cloud, and I think it’s important to see how people are adopting the cloud more.

Hope that was useful. Our booth is just outside here. The Snowflake booth is just next to us. We have some of our technical guys here that could give you a demo of Qubole, and also, Snowflake is showing you their product as well. If there aren’t any questions for now, please feel free to come up and ask us. Does anyone have any questions just before we finish? Great. Okay, everyone, thank you for your time. Hope that was useful.